Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report shows wrong taxonomic classification stats for QIIME with UNITE #652

Closed
d4straub opened this issue Oct 23, 2023 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@d4straub
Copy link
Collaborator

d4straub commented Oct 23, 2023

Description of the bug

When using UNITE fungi with QIIME2 for taxonomic classification, the statistics in the summary report (results/summary_report/summary_report.html) shows for rank "Kingdom" 100% classification, while all other ranks receive 0%.

This is because UNITE database contains strings such as

k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Aspergillus;s__Aspergillus_penicillioides
k__Fungi
k__Fungi;p__Ascomycota

while Greengenes 16S - Version 13_8 produces taxonomic strings such as

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Gallionellales; f__Gallionellaceae
k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Rhodoferax; s__

and parsing for the report takes only the Greengenes format into account with

# Remove greengenes85 ".__" placeholders
df = as.data.frame(lapply(asv_tax, function(x) gsub(".__", "", x)))
# remove all last, empty ;
df = as.data.frame(lapply(df, function(x) gsub(" ;","",x)))
# remove last remaining, empty ;
df = as.data.frame(lapply(df, function(x) gsub("; $","",x)))
# get maximum amount of taxa levels per ASV
max_taxa <- lengths(regmatches(df$Taxon, gregexpr("; ", df$Taxon)))+1

Other taxonomic classifications that I did, i.e. DADA2 with UNITE-Fungi, Kraken2, and SINTAX with UNITE-Fungi (see below), were fine.

Command used and terminal output

nextflow run nf-core/ampliseq -r 2.7.0 -profile cfc --FW_primer CTTGGTCATTTAGAGGAAGTAA --RV_primer GCTGCGTTCTTCATCGATGC --input_fasta "ASV_seqs.fasta" --min_len_asv 1 --dada_ref_taxonomy "unite-fungi=9.0" --sintax_ref_taxonomy "unite-fungi=9.0" --kraken2_ref_tax_custom "https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20231009.tar.gz" --kraken2_assign_taxlevels "D,P,C,O,F,G,S" --qiime_ref_taxonomy "unite-fungi" --outdir reclassification

Relevant files

No response

System information

No response

@d4straub d4straub added the bug Something isn't working label Oct 23, 2023
@d4straub
Copy link
Collaborator Author

d4straub commented Oct 23, 2023

Doesnt really fit in here, but the stats of the length filter also seems off:
I used --min_len_asv 1 (as above, just to get the distribution figure in the report) and the report says

Filtering omitted all ASVs with length lower than 1 bp.

The number of ASVs was reduced by 27.5 ( 1.51 %), from 1817.5 to 1790 ASVs.

which isnt right, because there were 1790 ASVs already in the input file and no ASV was removed.
The figure itself seems to be fine.

@d4straub
Copy link
Collaborator Author

d4straub commented Oct 25, 2023

Documentation issues:

@d4straub d4straub mentioned this issue Nov 9, 2023
10 tasks
@d4straub
Copy link
Collaborator Author

This is in dev now, closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant