UniversalEPI: Harnessing Attention Mechanisms to Decode Chromatin Interactions in Rare and Unexplored Cell Types
UniversalEPI is an attention-based deep ensemble designed to predict enhancer-promoter interactions up to 2 Mb, which can generalize across unseen cell types using only DNA sequence and chromatin accessibility (ATAC-seq) data as input.
- You can install the necessary packages by creating a conda environment using the provided .yml file:
This will create an environment named "universalepi".
conda env create -f environment.yml
- Download the data directory, unzip it, and place it in the root directory such that you have
./data
. - Download the model checkpoints and place each of them in the
./checkpoints
directory. Unzip each checkpoint.
a. Input data processing
- The details for processing ATAC-seq data from your raw input (BAM) or processed files (signal p-values bigwig and peaks bed) can be found in
preprocessing/atac
. This includes normalizing the bigwig and deduplication of ATAC-seq peaks.
b. Target data processing (only needed for training and testing)
preprocessing/hic
contains the details for processing Hi-C data from your raw input (.hic or .cool) or processed files (pairwise interaction files). This includes Hi-C normalization.- Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each training cell line
This also saves the updated ATAC-seq peaks at
python ./preprocessing/prepare_target_data.py --cell_line gm12878 --atac_bed_path ./data/atac/raw/GM12878_dedup.bed --hic_data_dir ./data/hic/
./data/atac/raw/GM12878_dedup_neg.bed
with 10% pseudopeaks added - Combine ATAC-seq and Hi-C to extract targets corresponding to ATAC peaks for each test cell line
The above script will run for all autosomes (chr1-22) by default. The Hi-C resolution is assumed to be 5Kb. The Hg38 genome version is considered by default. These can be modified using appropriate flags.
python ./preprocessing/prepare_target_data.py --cell_line hepg2 --atac_bed_path ./data/atac/raw/HEPG2_dedup.bed --hic_data_dir ./data/hic/ --test
- Create a new config file for your cell line or condition in
./Stage1/
. See./Stage1/
for more details. - Store the genomic inputs
This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at
python ./Stage1/store_inputs.py --cell_line imr90
./data/stage1_outputs/predict_imr90/
. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in./Stage1/configs/datamodule/validation/cross_cell.yaml
.
Preprocessed data for the HepG2 cell line can downloaded here.
- Ensure that the atac_path (
data/stage1_outputs/
) in./Stage2/configs/configs.yaml
is correctly set. Then runTo select a subset of chromosomes for prediction, usepython ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict imr90
This generatespython ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict imr90 --chroms_predict 2 6 19
./results/imr90/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npz
which stores the following information:- chr (chromosome)
- pos1 (position of ATAC-seq peak 1)
- pos2 (position of ATAC-seq peak 2)
- predictions (log Hi-C between peaks 1 and 2)
- variance (aleatoric uncertainty associated with the prediction)
- To obtain epistemic uncertainty, repeat Step 2 for each of the ten model checkpoints and take variance in predictions across the runs.
a. Train Stage1. It uses training cell lines defined in ./Stage1/configs/datamodule/validation/cross_cell.yaml
.
python ./Stage1/train.py
b. Test Stage1. It uses test cell lines defined in ./Stage1/configs/datamodule/validation/cross_cell.yaml
.
python ./Stage1/test.py
c. Train Stage2
- Ensure that genomic data (
./data/stage1_outputs/predict_{cell_line}
) and HiC paths (./data/hic/
) in./Stage2/configs/configs.yaml
are correct. Then runpython ./Stage2/main.py --config_dir ./Stage2/configs/configs.yaml --mode train
- If npz files are already generated using
create_dataset.py
andmerge_dataset.py
, the data paths can be specified in./Stage2/configs/configs.yaml
.
d. Test Stage2
- Ensuring the genomic data (
./data/stage1_outputs/predict_{cell_line_test}
) and test_dir path (if exist) in./Stage2/configs/configs.yaml
are correct, runThis generatespython ./Stage2/main.py --config_dir ./Stage2/configs/configs.yaml --mode test
./results/hepg2/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npz
which stores the following information:- chr (chromosome)
- pos1 (position of ATAC-seq peak 1)
- pos2 (position of ATAC-seq peak 2)
- predictions (log Hi-C between peaks 1 and 2)
- variance (aleatoric uncertainty associated with the prediction)
- targets (log Hi-C)
./Stage2/plot_scores.ipynb
can then be used to generate evaluation plots.