From 75843138f59d8350295b385dbd458a014e09ae9a Mon Sep 17 00:00:00 2001 From: Stephan Reichl <53785552+sreichl@users.noreply.github.com> Date: Thu, 29 Aug 2024 12:40:31 +0200 Subject: [PATCH] Update README.md --- README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 313bf6c..7dcee93 100644 --- a/README.md +++ b/README.md @@ -94,17 +94,25 @@ The workflow performs the following steps to produce the outlined results: - ...`splits` by using NA in the respective annotation column. - Annotations are also split and provided separately (`{annotation_column}_{annotation_level}/annotation.csv`). - Filter (`filtered.csv`) - - The features are filtered using the edgeR package's [filterByExpr](https://rdrr.io/bioc/edgeR/man/filterByExpr.html) function, which removes low count features that are unlikely to be informative. + - The features are filtered using the edgeR package's [filterByExpr](https://rdrr.io/bioc/edgeR/man/filterByExpr.html) function, which removes low-count features that are unlikely to be informative but likely to be statistically problematic downstream. + - The `min.count` parameter has the biggest impact on the filtering process, while `min.total.count` does not. + - The desired number of features depends on the data and assay used, below are some examples that provide a ballpark estimate based on previous experiences (feel free to ignore). + - Generally, you should filter until the mean-variance plot shows a consistent downward trend, with no upward trend at the low-expression end (left). + - RNA-seq, when starting with 55k genes it is not uncommon to end up with ~15k genes or less post-filtering. + - ATAC-seq consensus regions scale with the number of samples. Nevertheless, we had good experiences with ~100k genomic regions post-filtering. - Normalize (`norm{method}.csv`) - The data can be normalized using several methods to correct for technical biases (e.g., differences in library size). - All methods supported in edgeR's function [CalcNormFactors](https://rdrr.io/bioc/edgeR/man/calcNormFactors.html) with subequent [CPM/RPKM](https://rdrr.io/bioc/edgeR/man/cpm.html) quantification including method specific parameters can be configured. - - [CQN](https://bioconductor.org/packages/release/bioc/html/cqn.html) (Conditional Quantile Normalization) corrects for a covariate (e.g., GC-content) and feature length biases (e.g., gene length). The QR fit of the covariate and feature length are provided as plots (`normCQN_QRfit.png`). - - [VOOM](https://rdrr.io/bioc/limma/man/voom.html) (Mean-Variance Modeling at the Observational Level) from the package limma estimates the mean-variance relationship of the log-counts and generates a precision weight for each observation. The Mean-Variance trend plot is provided (`normVOOM_mean_variance_trend.png`). + - [CQN](https://bioconductor.org/packages/release/bioc/html/cqn.html) (Conditional Quantile Normalization) corrects for a covariate (e.g., GC-content) and feature-length biases (e.g., gene length). The QR fit of the covariate and feature-length are provided as plots (`normCQN_QRfit.png`). + - [VOOM](https://rdrr.io/bioc/limma/man/voom.html) (Mean-Variance Modeling at the Observational Level) from the package limma estimates the mean-variance relationship of the log counts and generates a precision weight for each observation. The Mean-Variance trend plot is provided (`normVOOM_mean_variance_trend.png`). - All normalization outputs are log2-normalized. - Integrate (`*_reComBat.csv`) - The data can be integrated using the [reComBat](https://github.com/BorgwardtLab/reComBat) method, which requires log-normalized data. - This method adjusts for batch effects and unwanted sources of variation while trying to retain desired sources of variation e.g., biological variability. - This is particularly useful when combining data from different experiments or sequencing runs. + - Use as few variables as possible for the (un)wanted parameters, as they often correlate (e.g., sequencing statistics) and can dilute the model's predictive/corrective power across multiple variables. + - For unwanted sources of variation, start with the strongest confounder; this is often sufficient. + - For wanted sources of variation, combine all relevant metadata into a single column (e.g., `condition`) and use only this. - Note: Due to a [reComBat bug](https://github.com/BorgwardtLab/reComBat/issues/3), a numerical confounder can only be corrected if at least one categorical confounder is also declared. - Using the same variable for both `batch` and `categorical confounder` parameters can cause opposite batch effects. - We recommend addressing the numerical confounder in downstream analyses, such as within a differential analysis model.