Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
update software list and references
add Method template
  • Loading branch information
sreichl authored Mar 10, 2022
1 parent 9a8cdfc commit c9ed8c5
Showing 1 changed file with 34 additions and 1 deletion.
35 changes: 34 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Table of contents
* [Report](#report)
* [Results](#results)
* [Examples](#examples)
* [Resources](#resources)
* [Tips & FAQs](#tips)

# Authors
Expand All @@ -42,6 +43,7 @@ This project wouldn't be possible without the following software
| Bowtie2 | https://doi.org/10.1038/nmeth.1923 |
| CQN | https://doi.org/10.1093/biostatistics/kxr054 |
| deeptools | https://doi.org/10.1093/nar/gkw257 |
| ENCODE | https://doi.org/10.1038/s41598-019-45839-z |
| fastp | https://doi.org/10.1093/bioinformatics/bty560 |
| HOMER | https://doi.org/10.1016/j.molcel.2010.05.004 |
| MACS2 | https://doi.org/10.1186/gb-2008-9-9-r137 |
Expand All @@ -51,14 +53,45 @@ This project wouldn't be possible without the following software
| pandas | https://doi.org/10.5281/zenodo.3509134 |
| samblaster | https://doi.org/10.1093/bioinformatics/btu314 |
| samtools | https://doi.org/10.1093/bioinformatics/btp352 |
| SCANPY | https://doi.org/10.1186/s13059-017-1382-0. |
| scikit-learn | http://jmlr.org/papers/v12/pedregosa11a.html |
| seaborn | https://doi.org/10.21105/joss.03021 |
| Snakemake | https://doi.org/10.12688/f1000research.29032.2 |
| TMM | https://doi.org/10.1186/gb-2010-11-3-r25 |
| UMAP | https://doi.org/10.21105/joss.00861 |
| UROPA | https://doi.org/10.1038/s41598-017-02464-y |

# Methods
Method Template coming soon
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table below. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

**Processing.**
Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup auto” and “--extsize 147” options on each sample. Homer (ver) [ref] function findMotifs was used for motif enrichment analysis over the detected open chromatin regions.

Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.

**Quantification.**
A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by 250 bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref].
Consensus regions were annotated using annotatePeaks function from Homer (ver) [ref].

Additionally, we annotated all consensus regions using UROPA (ver) [ref] and genomic features from the [GENCODE vX] basic gene annotation as: “TSS proximal” if the region’s midpoint was within [X] bp upstream or [X] bp downstream from a TSS, or if the region overlapped with a TSS; “gene body” if the region overlapped with a gene; “distal” if the region’s midpoint was within [X] bp of a TSS; and “intergenic” otherwise. For each region, only the closest feature was considered, and the annotations took precedence in the following order: TSS proximal, gene body, distal, and intergenic.

The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref].

**Optional.** We split up of data into subsets according to the annotation [X] and performed all downstream analyses for each subset separately.

**Downstream Analysis.**
For all downstream analyses, we filtered the [X] consensus regions to [X] regions which had reads in at least [X] samples, were covered by at least [X] reads (normalized by median library size) in at least [X] proportion of samples of the smallest subsample group with potential signal determined with the annotation [X], and by at least [X] total reads across all samples.

Next, we determined highly variable regions (HVR) using an adaption of a SCANPY (ver) [ref] function highly_variable_genes with flavor 'cellranger', but instead of dispersion=variation/mean we use dispersion=standard deviation, because in ATAC-seq data the number of regions might be very large. This could lead to log(cpm) values and log(cpm)-means being negative, resulting in negative dispersion values which are meaningless (additionally avoiding division by 0 problems). Therefore we only employ the binning strategy by mean for stabilization, but not a "coefficient of variation" strategy.

Conditional quantile normalization cqn (ver) [ref] was performed on the filtered accessibility matrix using the region length and GC-content of the consensus regions as conditions, quantified using bedtools (ver) [ref] and pybedtools (ver) [ref].

Trimmed mean of M-values (TMM) normalization (ver) [ref] was performed on the filtered accessibility matrix.

**Unsupervised Analysis & Visualization.**
We applied both linear and non-linear unsupervised dimensionality reduction methods to normalized data to visualize emerging sample-wise patterns in two dimensions. We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach and Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] with the correlation metric as the non-linear approach. For visualization matplotlib (ver) [ref] was used.

The processing and analysis described here was performed using a publicly available Snakemake [ver] (ref) workflow [ref - cite this workflow here].

# Features
- Processing
Expand Down

0 comments on commit c9ed8c5

Please sign in to comment.