Skip to content

UniversalEPI: Harnessing Attention Mechanisms to Decode Chromatin Interactions in Rare and Unexplored Cell Types

License

Notifications You must be signed in to change notification settings

BoevaLab/UniversalEPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniversalEPI

UniversalEPI: Harnessing Attention Mechanisms to Decode Chromatin Interactions in Rare and Unexplored Cell Types

Preprint   DOI

UniversalEPI is an attention-based deep ensemble designed to predict enhancer-promoter interactions up to 2 Mb, which can generalize across unseen cell types using only DNA sequence and chromatin accessibility (ATAC-seq) data as input. 

UniversalEPI architecture


Requirements

  • You can install the necessary packages by creating a conda environment using the provided .yml file:
    conda env create -f environment.yml
    
    This will create an environment named "universalepi".
  • Download the data directory, unzip it, and place it in the root directory such that you have ./data.
  • Download the model checkpoints and place each of them in the ./checkpoints directory. Unzip each checkpoint.

Step 1: Data Preprocessing

a. Input data processing

  • The details for processing ATAC-seq data from your raw input (BAM) or processed files (signal p-values bigwig and peaks bed) can be found in preprocessing/atac. This includes normalizing the bigwig and deduplication of ATAC-seq peaks.

b. Target data processing (only needed for training and testing)

  • preprocessing/hic contains the details for processing Hi-C data from your raw input (.hic or .cool) or processed files (pairwise interaction files). This includes Hi-C normalization.
  • Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each training cell line
    python ./preprocessing/prepare_target_data.py --cell_line gm12878 --atac_bed_path ./data/atac/raw/GM12878_dedup.bed --hic_data_dir ./data/hic/
    
    This also saves the updated ATAC-seq peaks at ./data/atac/raw/GM12878_dedup_neg.bed with 10% pseudopeaks added
  • Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each test cell line
    python ./preprocessing/prepare_target_data.py --cell_line hepg2 --atac_bed_path ./data/atac/raw/HEPG2_dedup.bed --hic_data_dir ./data/hic/ --test
    
    The above script will run for all autosomes (chr1-22) by default. The Hi-C resolution is assumed to be 5Kb. The Hg38 genome version is considered by default. These can be modified using appropriate flags.

Step 2: Extract Genomic Features from Stage 1

  1. Create a new config file for your cell line or condition in ./Stage1/. See ./Stage1/ for more details.
  2. Store the genomic inputs
    python ./Stage1/store_inputs.py --cell_line imr90
    
    This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at ./data/stage1_outputs/predict_imr90/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in ./Stage1/configs/datamodule/validation/cross_cell.yaml.

Preprocessed data for the HepG2 cell line can downloaded here.


Step 3: Generate Hi-C Predictions from Stage 2

  1. Ensure that the atac_path (data/stage1_outputs/) in ./Stage2/configs/configs.yaml is correctly set. Then run
    python ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict imr90
    
    To select a subset of chromosomes for prediction, use
    python ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict imr90 --chroms_predict 2 6 19
    
    This generates ./results/imr90/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npz which stores the following information:
    • chr (chromosome)
    • pos1 (position of ATAC-seq peak 1)
    • pos2 (position of ATAC-seq peak 2)
    • predictions (log Hi-C between peaks 1 and 2)
    • variance (aleatoric uncertainty associated with the prediction)
  2. To obtain epistemic uncertainty, repeat Step 2 for each of the ten model checkpoints and take variance in predictions across the runs.

UniversalEPI Training and Testing

a. Train Stage1. It uses training cell lines defined in ./Stage1/configs/datamodule/validation/cross_cell.yaml.

python ./Stage1/train.py

b. Test Stage1. It uses test cell lines defined in ./Stage1/configs/datamodule/validation/cross_cell.yaml.

python ./Stage1/test.py

c. Train Stage2

d. Test Stage2

  • Ensuring the genomic data (./data/stage1_outputs/predict_{cell_line_test}) and test_dir path (if exist) in ./Stage2/configs/configs.yaml are correct, run
    python ./Stage2/main.py --config_dir ./Stage2/configs/configs.yaml --mode test
    
    This generates ./results/hepg2/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npz which stores the following information:
    • chr (chromosome)
    • pos1 (position of ATAC-seq peak 1)
    • pos2 (position of ATAC-seq peak 2)
    • predictions (log Hi-C between peaks 1 and 2)
    • variance (aleatoric uncertainty associated with the prediction)
    • targets (log Hi-C)
  • ./Stage2/plot_scores.ipynb can then be used to generate evaluation plots.

About

UniversalEPI: Harnessing Attention Mechanisms to Decode Chromatin Interactions in Rare and Unexplored Cell Types

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published