Skip to content
Anusri Pampari edited this page Dec 30, 2022 · 15 revisions

(1) How to pick a pre-trained bias model?

Pick a pre-trained bias model that is similar in experimental setting to the current dataset. For example, if your current dataset follows a Omni ATAC-seq protocol use a bias model trained on any Omni ATAC-seq protocol dataset. If multiple Omni ATAC-seq pre-trained bias model exists pick a bias model trained on the the closest biosample or highest read depth.

(2) What is the intuition in choosing the hyperparameter for bias_threshold_factor? How to retrain the bias model based on this ?

Non peak regions used in bias model training are filtered based on The bias_threshold_factor which is used as follows. The regions with total counts greater than 0.1_quantile(total counts in peaks)*bias_threshold_factor are filtered out.

  • If bias_threshold_factor is set to very low value you might filter out a lot of non-peak regions and the resulting training set might have GC distribution that is different from that in peak regions. This will result in bias model transfer because the model will learn a AT rich bias.

  • If bias_threshold_factor is set to very high you might include non-peak regions with high counts. High counts are typically caused by Trnascription Factor (TF) motifs and this might lead to the bias model capturing cell-type specific TFs (which is not ideal as we want to regress out only the bias motifs effect and not cell-type specific TF motifs effect).

In choosing a value for bias_threshold_factor we recommend starting from 0.5 for ATAC and 0.8 for DNase. If these models capture TF motifs reduce the bias_threshold_factor by 0.1 and keep doing this until you can train a model that captures only bias motifs (without any TF motifs). If these models have high negative pearsonr (> -0.3) in peak regions but capture no TF motifs, increase the bias_threshold_factor by 0.1 until you you can increase the pearsonr value (to a value > -0.3) while also not capturing any TF motifs.

(3) Why does the quality check reports involve getting contribution scores and TFModisco motifs only on subset of peaks?

It is computationally expensive to run these steps on all the peaks. Since we are looking for a quick sanity check for our models we will subsample 30K peaks from our original peak set for interpretation and TFMOdisco. The users are encouraged to eventually run chrombpet contribs_bw and chrombpnet modisco_motifs step on all the peak regions for interpretation and motif discovery.