diff --git a/README.md b/README.md index 7d49c2d..313bf6c 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ [![DOI](https://zenodo.org/badge/659800258.svg)](https://zenodo.org/badge/latestdoi/659800258) # Split, Filter, Normalize and Integrate Sequencing Data -A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to split, filter, normalize, integrate and select highly variable features of count matrices resulting from experiments with sequencing readout (e.g., RNA-seq, ATAC-seq, ChIP-seq, Methyl-seq, miRNA-seq, ...) including diagnostic visualizations documenting the respective data transformations. This often represents the first analysis after signal processing critically influencing all downstream analyses. +A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to split, filter, normalize, integrate, and select highly variable features of count matrices resulting from experiments with sequencing readout (e.g., RNA-seq, ATAC-seq, ChIP-seq, Methyl-seq, miRNA-seq, ...) including diagnostic visualizations documenting the respective data transformations. This often represents the first analysis after signal processing critically influencing all downstream analyses. This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others to find and benefit from the effort and me to prioritize my efforts! @@ -51,7 +51,7 @@ This project wouldn't be possible without the following software and their depen # Methods -This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References `[ref]` to the respective publications are curated in the software table above. Versions `(ver)` have to be read out from the respective conda environment specifications (`workflow/envs/*.yaml` file) or post execution in the result directory (`envs/spilterlize_integrate/*.yaml`). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., `[X]`. +This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References `[ref]` to the respective publications are curated in the software table above. Versions `(ver)` have to be read out from the respective conda environment specifications (`workflow/envs/*.yaml` file) or post-execution in the result directory (`envs/spilterlize_integrate/*.yaml`). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., `[X]`. __Split.__ The input data was split by `[split_by]`, with each split denoted by `[split_by]_{annotation_level}`. The complete data was retained in the "all" split. Sample filtering was achieved by removing sample rows from the annotation file or using `NA` in the respective annotation column. Annotations were also split and provided separately. The data was loaded, split, and saved using the Python library pandas `(ver)[ref]`. @@ -105,6 +105,9 @@ The workflow performs the following steps to produce the outlined results: - The data can be integrated using the [reComBat](https://github.com/BorgwardtLab/reComBat) method, which requires log-normalized data. - This method adjusts for batch effects and unwanted sources of variation while trying to retain desired sources of variation e.g., biological variability. - This is particularly useful when combining data from different experiments or sequencing runs. + - Note: Due to a [reComBat bug](https://github.com/BorgwardtLab/reComBat/issues/3), a numerical confounder can only be corrected if at least one categorical confounder is also declared. + - Using the same variable for both `batch` and `categorical confounder` parameters can cause opposite batch effects. + - We recommend addressing the numerical confounder in downstream analyses, such as within a differential analysis model. - Highly Variable Feature Selection (`*_HVF.csv`) - The top percentage of the most variable features is selected based on the binned normalized dispersion of each feature adapted from [Zheng (2017) Nature Communications](https://doi.org/10.1038/ncomms14049). - These HVFs are often the most informative for downstream analyses such as clustering or differential expression, but smaller effects of interest could be lost. @@ -113,19 +116,19 @@ The workflow performs the following steps to produce the outlined results: - All transformed datasets are saved as CSV files and named by the applied methods, respectively. - Example: `{split}/normCQN_reComBat_HVF.csv` implies that the respective data `{split}` was filtered, normalized using CQN, integrated with reComBat and subset to its HVFs. - Visualizations (`{split}/plots/`) - - Next to the method specific visualizations (e.g., for CQN, HVF selection), a **diagnostic figure** is provided for every generated dataset (`*.png`), consisting of the following plots: - - Mean-Variance relationship of all features as hexagonal heatmap of 2d bin counts. + - Next to the method-specific visualizations (e.g., for CQN, HVF selection), a **diagnostic figure** is provided for every generated dataset (`*.png`), consisting of the following plots: + - Mean-Variance relationship of all features as a hexagonal heatmap of 2d bin counts. - Densities of log-normalized counts per sample colored by sample or configured annotation column. - Boxplots of log-normalized counts per sample colored by sample or configured annotation column. - Principal Component Analysis (PCA) plots, with samples colored by up to two annotation columns (e.g., batch and treatment). - Confounding Factor Analysis to inform integration (`*_CFA.png`) - Quantification of statistical association between provided metadata and (up to) the first ten principal components. - Categorical metadata association is tested using the non-parametric Kruskal-Wallis test, which is broadly applicable due to relaxed requirements and assumptions. - - Numeric metadata association is tested using rank-based Kendall's Tau, which is suitibale for "small" data sets with many ties and is robust to outliers. + - Numeric metadata association is tested using rank-based Kendall's Tau, which is suitable for "small" data sets with many ties and is robust to outliers. - Statistical associations as `-log10(adjusted p-values)` are visualized using a heatmap with hierarchically clustered rows (metadata). - Correlation Heatmaps (`*_heatmap_{clustered|sorted}.png`) - Heatmap of sample-wise Pearson correlation matrix of the respective data split and processing step to quickly assess sample similarities e.g., replicates/conditions should correlate highly but batch shoud not. - - Hierarchicaly clustered using method 'complete' with distance metric 'euclidean' (`*_heatmap_clustered.png`). + - Hierarchically clustered using method 'complete' with distance metric 'euclidean' (`*_heatmap_clustered.png`). - Alphabetically sorted by sample name (`*_heatmap_sorted.png`). - Note: raw and filtered counts are log2(x+1)-normalized for the visualizations. - These visualizations should help to assess the quality of the data and the effectiveness of the processing steps (e.g., normalization).