This document describes the output produced by the pipeline.
All samba workflow results are stored in a global analysis report available in results/[projectName]/00_report/SAMBA_report.html
. This report is based on a Jinja2 template and gives a synthesis of the community profiles and characteristics of your dataset :
- Bioinformatic processes are described with software versions and used parameters and important results for each step.
- Statistical analyses results can be quickly compared for each variable of interest to understand environmental or experiments effects and samples similarities and differences.
Here is a sample report produced with samba pipeline (animated GIF image):
All sections below described the content of such a report.
The pipeline is built using Nextflow and processes data using the following steps:
- Data integrity - Raw data integrity checking
- Importing raw data - Create QIIME2 objects
- Primers removal - Remove primers from raw reads
- QC and feature table - QC and feature table and counts table
- ASV clustering - [OPTIONAL] Distribution and phylogeny based clustering
- Taxonomic assignation - QIIME2 Naive bayesian classifier assignation
- Taxonomy filtering - [OPTIONAL] Filtering ASV table and sequences based on taxonomic assignation
- Samples decontamination - [OPTIONAL] Samples decontamination based on control samples
- Phylogeny - ASV sequences aligment and tree
- Differential abundance - ANCOM analysis
- Functional predictions - [OPTIONAL] PICRUSt2 functionnal predictions
- Data preparation - Create R-Phyloseq object
- Alpha diversity - [OPTIONAL] Communities intra-specific diversity
- Beta diversity - [OPTIONAL] Communities inter-specific diversity
- Descriptive Comparisons - [OPTIONAL] Based on UpsetR graph
- Global analysis report - Synthesis and results of communities analysis
[OPTIONAL]
Bash script used to check raw sequencing data and metadata files integrity.
- Demultiplexing control checks if barcodes are the same in reads names within a sample file and corresponding to the metadata file
- Multiple sequencer detection checks if sequencer names are the same in the reads names within a sample file.
- The primer ratio control checks if at least 70% of the raw reads sequences within a sample contain the sequencing primer.
- The headers of the metadata file are checked in order to fit to the QIIME2 metadata requirements.
Data integrity specific parameters can be set for samba custom usage.
A data integrity CSV report data-integrity.csv
is produced in the pipeline output directory : results/[projectName]/steps_data/01_data_integrity
:
QIIME2 import step creates :
- a QIIME2 object using QIIME2 manifest and metadata input files.
- QIIME2 reads count overview html statistics :
- QIIME2 html quality plots of the raw reads sequences :
The QIIM2 import report index.html
is available in output directory : results/[projectName]/00_report/import_output
QIIME2 Cutadapt will remove primers from raw sequences, generate quality plots of cleaned and reads counts for each sample. Output report will create the same graphs as the ones created in data importation step.
Cutadapt specific parameters can be set for samba custom usage.
QIIME2 cutadapt report index.html
is available in output directory : results/[projectName]/00_report/trimmed_output
The inference of ASV (Amplicon Sequence Variant) is performed using QIIME2 Dada2 algorithm. DADA2 can filter and trim cleaned reads before running an error model learning algorithm which will correct the reads if necessary before the reads quality control and feature table are created. Then, reads each ASV sequences are merged (in running in paired-end mode) and chimeras are removed.
DADA2 specific parameters can be set for samba custom usage.
The output directory : results/[projectName]/00_report/dada2_output
contains :
- QIIME2 DADA2 report
index.html
with the remaining number of sequences and ASV in each sample : - QIIME2 DADA2 report
sample-frequency-detail.html
with interactive ASV counts for each sample metadata : - QIIME2 DADA2 report
feature-frequency-detail.html
with ASV frequency and observation counts in each sample : - All ASV sequences in a fasta file :
sequences.fasta
- A biom counting table :
feature-table.biom
[OPTIONAL]
QIIME2 dbotu3 plugin will cluster ASV sequences from their distribution across samples and phylogenetic tree.
The output directory : results/[projectName]/00_report/dbotu3_output
contains :
- QIIME2 dbOTU3 report
index.html
with sample and feature frequencies - All ASV sequences in a fasta file :
sequences.fasta
- A biom counting table :
feature-table.biom
QIIME2 feature-classifier will use a Naive Bayes classifier that can be used on global marker reference database or be trained on only the region of the target sequences. Check the available parameters for this step.
The output directory : results/[projectName]/00_report/taxo_output
contains :
- QIIME2 taxonomy report
index.html
with ASV list, taxonomic assignation and confidence score. - the merging of counts and taxonomy for each ASV in a TSV file :
ASV_taxonomy.tsv
[OPTIONAL]
QIIME2 taxa plugin will use user-defined taxa to exclude these unwanted taxa from the count table and the representative ASV sequences.
[OPTIONAL]
microDecon R package is used to remove contamination from control samples to experiment samples. Controls samples and number of samples to decontaminate are specified in samba parameters.
The output directory : results/[projectName]/00_report/microDecon
contains :
- the ASV concerned sequences in
decontaminated_ASV.fasta
. - the decontaminated counting table in TSV format :
decontaminated_ASV_table.tsv
- the list of removed ASV :
ASV_removed.txt
- the abundance of the removed ASV :
abundance_removed.txt
QIIME2 sequences alignment and phylogeny are performed with MAFFT and Fastree algorithms.
The output directory : results/[projectName]/00_report/tree_export_dir
contains :
- the ASV phylogenetic tree in newick format :
tree.nwk
QIIME2 ANCOM analysis will compare the composition of microbiomes and identify ASV that differ in abundance. ANCOM variable can be specified in samba parameters.
The output directory : results/[projectName]/00_report/ancom_output
contains :
- the ANCOM analysis report :
export_ancom_[ancom_var]/index.html
: - the ANCOM analysis report at family level :
export_ancom_[ancom_var]_family/index.html
- the ANCOM analysis report at genus level :
export_ancom_[ancom_var]_genus/index.html
[OPTIONAL]
QIIME2 picrust2 plugin is used to get EC, KO and MetaCyc pathway predictions base on 16S data. Picrust2 HSP method and NSTI cut-off can be modified in the workflow parameters.
The output directory : results/[projectName]/00_report/picrust2_output
contains :
- an NDMS for each EC, KO and MetaCyc pathways for the selected variable. Example for EC :
- a picrust analysis report
q2-picrust2_output/pathway_abundance_visu/index.html
with pathways frequencies.
In order to perform diversity analysis, a R Phyloseq object is created with the counting table, the sample metadata description and the ASV phylogenetic tree.
The output directory : results/[projectName]/00_report/R/DATA
contains the Phyloseq object and the counting table ready for performing statistics analysis.
In order to evaluate samples intra-specific diversity, several diversity indexes are calculated :
- Observed : the sample richness, ie. the number of different ASV per sample.
- Chao1 : the predicted richness index.
- InvSimpson : the probability that two sequences taken at random from a sample belongs to same taxa.
- Shannon : the entropy index reflects the specific diversity of the sample. The more the index is high, the more the diversity and equitabily are high.
- Pielou : the community equitability index.
Then, taxonomic barplots from phylum to genus are produced.
Alpha diversity parameters can be specified in the workflow.
The output directory : results/[projectName]/00_report/R/FIGURES/alpha_diversity
contains :
- a samples rarefaction curve :
rarefaction_curve.png
: - the diversity indexes plot :
diversity_index/alpha_div_[VARNAME].png
- the taxonomic barplots :
diversity_barplots/[VARNAME]
from Phylum to Genus :
Sample distances are evaluated through beta diversity analyses. The ASV count table will be normalized to calculate beta diversity distance matrices.
Four normalization methods are used in samba :
- No-normalization : the beta diversity is calculated on raw ASV counts. Warning : we do not recommend to use these results for your data interpretation, this normalization aims to help to select the normalization method that fits the best your dataset.
- Rarefaction : the rarefaction normalization consists in reducing the number of sequences in the samples to the size of the smallest sample. This method is recommended if all your samples have almost the same sequences number repartition. Beware if you have samples with low and high number of sequences, you could lost diversity and end to a bad results interpretation.
- Bioconductor DESeq2 normalization : DESeq2 has been widely used in RNA-seq analysis to detect differential gene expression. This method can also be used as metabarcoding data normalization to evaluate if an ASV is more or less present through samples. Remind that the ending normalised table will contain positive and negative values and will thereby not be usable as an input for further analysis.
- Bioconductor metagenomeSeq CSS : Cumulative Sum Scaling returns a matrix normalized by scaling counts up to and including the pth quantile. The method will give more weight to rare species.
For each normalization method, four distance matrices are calculated :
- Jaccard distance is a qualitative measure which indicates if an ASV is present or not. It will be equal to 0 if the ASV is not present in the sample or 1 if it is present, no matter if the ASV is rare or abundant.
- Bray-Curtis distance is a quantitive measure which is based on specific ASV abundance over the samples. If two samples share the same communities, their Bray-Curtis distance will be equal to 0 whereas it will tend to 1 if the communities between the samples are different.
- Unifrac distance is a qualitative distance based on the shared phylogenetic tree branches of the samples.
- Weighted-Unifrac distance is a quantitative distance based on ASV abundance and on shared phylogenetic tree branches of the samples.
These distance matrices are represented through PCoA and NMDS (including ADONIS test) ordination plots. A Hierarchical clustering of the samples is also provided by samba.
Beta diversity parameters can be specified in the workflow.
The output directory : results/[projectName]/00_report/R/FIGURES/beta_diversity_[NORM_METHOD]
contains 4 directories :
- PCoA with PCoA plots images (png and svg format) for each distance matrix.
- NMDS with NMDS plots images (png and svg format) for each distance matrix.
- Hierachical_Clustering with hierarchical clustering plots images (png and svg format) for each distance matrix using clustering method set in samba parameters.
- ExpVar with pie charts images (png and svg format) for each distance matrix representing the percentage of explained variance for each experiment variables.
- Files variance_signifiance_tests_[NORM_METHOD].txt for each distance (i.e jaccard, bray, unifrac, wunifrac) with Adonis test results combining each experiment variables.
Here are some examples of the plots for beta diversity analysis available in samba workflow (example with DESeq2 normalization method and Bray-Curtis distance matrix) based on the selected experiment variable transect_name :
- Pie chart : Percentage of explained variance for each experiment variable :
- PCoA plot :
- NMDS plot :
- Hierarchical clustering with Ward.D2 method :
This step is based on UpSetR package and provide an alternative to Venn diagram to deal with more than 3 sets.
Descriptive comparisons parameters can be specified.
The output directory : results/[projectName]/00_report/R/FIGURES/descriptive_comparison
contains the UpSetR graphs images in png and svg format.
In the test dataset, this graph enables to compare the number of ASV and their abundance between samples group selected variable regarding to the total of ASV by sample groups :