Skip to content

Latest commit

 

History

History
executable file
·
265 lines (224 loc) · 29.3 KB

CHANGELOG.md

File metadata and controls

executable file
·
265 lines (224 loc) · 29.3 KB

CDCgov/phoenix: Changelog

Below are the list of changes to phx since is initial release. As fixes can take multiple commits to fix the linked commit are the point at which the fix was complete. Sometimes additional changes are needed later so commits give an approximate reference for the fix. Check commits on the specific file of interest if the commit link seems off.

v1.0.0 (10/12/2022)

🎉First official release.🎉

Full Changelog

v1.1.0 (03/06/2023)

Full Changelog

Implemented Enhancements:

  • Default branch set to main thanks @erinyoung #84.
  • Added emits to allow linking of workflows to close #42 #e32132d.
  • MLST output is now scanned for completeness of profiles by consolidating any allele tags to the ST column for easier scanning as well as known paralog alleles are marked for easier identification. In CDC_PHOENIX workflow ST types are consolidated, if applicable, to show concordance bewteen tools.
  • Addition of 🔥🐎🐦🔥 GRiPhin: General Report Pipeline from PHoeNIx output to -entry CDC_PHOENIX #6291e9c. This was implemented to replace common report generated internally, which is why it is only in the -entry CDC_PHOENIX.
  • Changes to allow relative paths for kraken2 and BUSCO database to be passed rather than it requiring it to be a full path #ecb3618 and #d938a64.
  • Phoenix_Output_Report.tsv now has antibiotic genes and plasmid markers filtered to ensure quality #d0fa32c.
    • Plasmid markers require >=60% length and >=98% identity to be reported
    • Antibiotic Genes require >=90% length and >=98% identity to be reported
  • AMRFinder+ point mutation are now included in Phoenix_Output_Report.tsv under the column AMRFinder_Point_Mutations.
  • In determine_taxID.sh, Upper taxonomy lineage now uses NCBI names and nodes files for the ability to assign nearly all possible taxonomies compared to the very limited options with the previous taxes.csv file

Output File Changes:

  • Removed spaces in header of *_all_genes.tsv file from AMRFinder+ output and replace with underscore to allow for more friendly parsing #fd048d1.
  • Fixed error causing PROKKA output to not be in Annotation folder #d014aa0.
  • Added headers to 2 files: *.fastANI.txt and *.wtasmbld_summary.txt.
  • Also, added headers to phoenix_line_summary.tsv see wiki for details.
  • MLST final output that includes different headers and organization was renamed to *_combined.tsv which includes srst2 types, if appicable, paralog tags, and any extra allele/profile tags.
  • Taxonomy file now includes NCBI TaxID at each standard level. Example species line would like like this "s:287 aeruginosa"

Fixed Bugs:

  • Edit to allow nf-tower to work #b21d61f
  • Fixed pipeline failure when prokka throws error for sample names being too long (Error: ID must <= 37 chars long) #e48e01f. Now sample name length doesn't matter.
  • Fixed bug where samples wouldn't end up in the Phoenix_Output_Report.tsv due to srst2 not finding any AR genes so the file wasn't created. Now blank file is created and remaining sample informatin is in the Phoenix_Output_Report.tsv #2f52edc. This change only occured in -entry CDC_PHOENIX.
  • Fixed issue where cp error was thrown when relative path was given for output directory #0c0ca55 and #d938a64.
  • MLST PARALOGS for Acinetobacter baumannii are surpressed in GRiPHin report as they are...

Database Updates:

Container Updates:

  • MLST updated from 2.22.1 to 2.23.0.
  • BBTools updated from 38.96 to 39.01.
  • AMRFinder+ was updated from 3.10.40 to 3.10.45.
  • Scripts the utilize the phoenix_base container were updated to quay.io/jvhagey/phoenix:base_v1.1.0 which had the python library xlsxwriter added to it for GRiPHin.py.

v1.1.1 (03/21/2023)

Full Changelog

Implemented Enhancements:

  • -entry CDC_PHOENIX workflow checks all FASTQ files for corruption and creates a list of the checked files usng the FAIry (FASTQ file Assesment of Integrity) tool commit 1111df8. This is a required internal QC check.
  • Expanded MLST lookup of Citrobacter species complex commit 43ea24d lists the new species.
  • Increased SPAdes CPUs to 8 and memory to 16GB in base.config.

Fixed Bugs:

  • Fix for issue #99 where first gene in ar, plasmid and hypervirulence genes didn't end up in the *_summaryline.tsv. This same error was in Phoenix_summary_line.py that caused the first sample to not be include in the final report.
  • Fixed tabulation error into *_combined.tsv output files that in some cases would show in GRiPHin_Report.xlsx output as a long singular line as the MLST type.
  • Fix for issue #91 where Klebsiella MLST lookup would not properly match to the correct lookup database.
  • Fixed problem where samples that didn't create scaffolds, but created contigs didn't have species printed out in Phoenix_Output_Report.tsv details in commit c7f7ea5.
  • Fixed problem in -entry CDC_PHOENIX where samples that didn't create scaffolds, but created contigs or samples that failed spades completely didn't have correct columns lining up in Phoenix_Output_Report.tsv details in commit d17bdda.

v2.0.0 (07/14/2023)

Full Changelog

Implemented Enhancements:

  • entry point for scaffolds added using either -entry SCAFFOLDS or -entry CDC_SCAFFOLDS that runs everything post SPAdes step. New input parameters --indir and --scaffold_ext added for functionality of this entry point commit f12da60.
    • Supports scaffold files from shovill, spades and unicycler.
  • entry point for sra added using either -entry SRA or -entry CDC_SRA. These entry points will pull samples from SRA based on what is passed to --input_sra, which is a file with one SRR number per line commit a86ad3f.
  • Check now performed on input samplesheets to confirm the same sample id, forward read and reverse read aren't used multiple times in the samplesheet commit fd6127f.
  • Changed many modules to process_single rather than process_low to reduce resource requirements for these steps.
  • Updates to run PHX on nf-tower with an AWS back-end. Also, updated tower.yml file to have working reports.
  • AMRFinder+ was updated v3.11.11 allows point mutation calling for Burkholderia cepacia species complex, Burkholderia pseudomallei species complex, Serratia marcescens and Staphylococcus_pseudintermedius.
  • Argument, --coverage added. Can be passed to increase coverage cut off that will cause sample to fail minimum qc standards (default is 30x).
  • Public Kraken2 database is required rather than requesting from sharefile. For PHoeNIx >=2.0.0 you will need to download the public Standard-8 version kraken2 database created on or after March 14th, 2023 from Ben Langmead's github page. You CANNOT use an older version of the public kraken databases on Ben Langmead's github page. We thank @BenLangmead and @jenniferlu717 for taking the time to include an extra file in public kraken databases created after March 14th, 2023 to allow them to work in PHoeNIx!

Output File Changes:

  • The folder fastqc was changed to fastqc_trimd to clarify it contains results from the trimmed data.
  • PROKKA module now outputs .fsa file (nucleotide file of genes) rather than .fna as the .fna file is really just the assembly file again.
  • Added version for base container information for FAIRY, ASSET_CHECK, FORMAT_ANI, FETCH_FAILED_SUMMARIES, CREATE_SUMMARY_LINE, GATHER_SUMMARY_LINES, and GENERATE_PIPELINE_STATS. This was added to software_versions.yml.
  • Changing the file/folder structure of some files for clarity and to make it less cluttered:
    • Folders Annotation and Assembly were changed to annotation and assembly respectively to keep continuity.
    • Files kraken2_asmbld/*.unclassified.fastq.gz and kraken2_asmbld/*.classified.fastq.gz were changed to kraken2_asmbld/*.unclassified.fasta.gz and kraken2_asmbld/*.classified.fasta.gz as they are actually fasta files.
    • *.fastANI.txt --> moved from ~/ANI/fastANI to ~/ANI.
    • The file *_trimmed_read_counts.txt that was in fastp_trimd was moved to the folder qc_stats.
    • Files *_fastqc.zip and *_fastqc.html in folder fastqc_trimd moved to qc_stats.
    • *.bbduk.log --> moved from ~/removedAdapters to ~/${sample}/qc_stats and removedAdapters is not longer and output folder.
    • raw_stats folder was created and contains ${sample}_raw_read_counts.txt and ${sample}_FAIry_synopsis.txt, previously these were in the folders fastp_trimd and FAIry, respectively.
  • Sample GC% added to *_GC_content_20230504.txt file.
  • *_trimmed_read_counts.txt has Paired_Sequenced_[reads] column added as Total_Sequenced_[reads] is the number of the paired sequences and singletons.
  • Files produced from FastANI, MASH and FORMAT_ANI had mash database's data appended to the file name for tracking and validation. Files are now named *${sample}_REFSEQ_20230504.ani.txt, ${samplename}_REFSEQ_20230504.fastANI.txt, ${samplename}_REFSEQ_20230504_best_MASH_hits.txt and ${samplename}_REFSEQ_20230504.txt.
  • GRiPHin file updates
    • New columns for WARNINGS, ALERTS, Minimum_QC_Issues, Total_Raw_[reads], Paired_Trimmed_[reads] and GC%.
    • New column Primary_MLST_Source as added to show if the assmebly (MLST program) or reads (SRST2) was used for MLST determination.
    • Auto_PassFail and PassFail_Reason were changed to Minimum_QC_Checks and Minimum_QC_Issues, respectively. This was to clarifiy these are minimum requirements for QC.
    • The column Total_Sequenced_[bp] was removed from the report for lack of utility.
    • Q30_R1_[%], Q30_R2_[%], and Total_Sequenced_[reads] were relabelled as Raw_Q30_R1_[%], Raw_Q30_R2_[%] and Total_Trimmed_[reads], respectively for clarity.

Fixed Bugs:

  • Added module GET_RAW_STATS to get raw stats, previously this was information was pulled from FASTP_TRIMD step, however, the input data here was post BBDUK which removes PhiX reads and adapters. Thus, the previous raw count was slightly off.
  • Fixed python version information not showing up for GET_TAXA_FOR_AMRFINDER and GATHERING_TRIMD_READ_QC_STATS. This was added to software_versions.yml.
  • Fixed issue where sample names with underscore it in caused incorrect parsing and contig number not showing up in GRiPHin reported genes commit a0fdff5.
  • Fixed AttributeError: 'DataFrame' object has no attribute 'map' error that came up in GRiPhin step when your set of samples had both a macrolide and macrolide_lincosamide_streptogramin AR gene commit 460bdbc.
  • Phoenix_Output_Report.tsv was reporting %Coverage for FastANI in the Taxa_Confidence column rather than %ID. Now both are reported when FastANI is successful commit 3b26fec.
  • GRiPHin_Report.xlsx was switch from reported rounded numbers for coverage/similarity % to reporting the floor as reporting 100% when 99.5% is the actual number is misleading and doesn't alert the user to SNPs in genes. Now by switching to the floor 99.5% would be reported as 99% commit 5477627.
  • Corrected GAMMA modules not printing the right version in the software_version.yml file commit 5477627.

Database Updates:

  • Curated AR gene database was updated on 2023-05-17 (yyyy-mm-dd) which includes:
  • Updated AMRFinder Database used by AMRFinder+ and GAMMA to v2023-04-17.1.
  • SRST2_MLST and MLST step now use the mlst_db which is provided in ~/phoenix/assests/databases this is now static and no longer pulls updates from PubMLST.org. This will keep the pipeline running when PubMLST.org is down and keeps the schemes from changing if you run the same sample at different times. This was implemented to deal with PubMLST.org being down fairly often and with pipeline validation in mind.

Container Updates:

  • AMRFinder+ was updated from 3.10.45 to 3.11.11.
  • BUSCO was updated from 5.4.3 to 5.4.7.
  • MultiQC was updated from 1.11 to 1.14.
  • MLST was updated from 2.22.1 to 2.23.0.

v2.0.1 (07/14/2023)

Full Changelog

Implemented Enhancements:

  • Updated nextflow tower scheme that describes inputs.

Fixed Bugs:

  • Typo fix and changed branch called in Terra task that caused Terra version to crash.

v2.0.2 (08/03/2023)

Full Changelog

Implemented Enhancements:

  • Added handling for -entry SCAFFOLDS and CDC_SCAFFOLDS to accept assemblies from tricylcer and flye commit 31cb573.
  • Added tsv version of GRiPHin_Summary.xlsx

Output File Changes:

  • GRiPHin_samplesheet.csv changed to Directory_samplesheet.csv commit b39d8d7
  • In response to feedback from compliance program, "report" is being replaced by "summary" in file names to avoid confusion regarding the difference between public health results (i.e. summary) and diagnostic results (i.e. report) commit b39d8d7
    • GRiPHin_Report.xlsx changed to GRiPHin_Summary.xlsx
    • Phoenix_Output_Report.tsv changed to Phoenix_Summary.tsv
    • quast/${samplename}_report.txt changed to quast/${samplename}_summary.tsv
    • kraken2_trimd/${samplename}.trimd_summary.txt changed to kraken2_asmbld/${samplename}.kraken2_trimd.top_kraken_hit.txt
    • kraken2_asmbld/${samplename}.asmbld_summary.txt changed to kraken2_asmbld/${samplename}.kraken2_asmbld.top_kraken_hit.txt
    • kraken2_asmbld_weighted/${samplename}.wtasmbld_summary.txt changed to kraken2_asmbld/${samplename}.kraken2_wtasmbld.top_kraken_hit.txt
    • kraken2_trimd/${samplename}.kraken2_trimd.report.txt changed to kraken2_trimd/${samplename}.kraken2_trimd.summary.txt
    • kraken2_asmbld/${samplename}.kraken2_asmbld.report.txt changed to kraken2_asmbld/${samplename}.kraken2_asmbld.summary.txt
    • kraken2_asmbld_weighted/${samplename}.kraken2_wtasmbld.report.txt changed to kraken2_asmbld_weighted/${samplename}.kraken2_wtasmbld.summary.txt

Fixed Bugs:

  • For MLST when final alleles were assigned, PHX called 100% match despite 1 allele not being a match.
  • MLST step not using the custom database. A custom MLST container was added with this database included.

Container Updates:

  • MLST version remains the same, but a custom database was added so that it no longer uses the database included in the software. Now hosted on quay.io.
  • Bumped up base container (v2.0.2) to have openpyxl module.

v2.1.0 (02/11/2024)

Full Changelog

Implemented Enhancements:

  • Added handling for "unknown" assemblers in the scaffolds entry point so genomes can be downloaded from NCBI and run through PHoeNIx.
  • For entry points CDC_PHOENIX or PHOENIX you can now use the argument --create_ncbi_sheet to generate partially filled out excel sheets for uploading to NCBI. You will still need to fill in some lab/sample specific information and review for accuracy, but this should speed up the process. As a reminder, please do not submit raw sequencing data to the CDC HAI-Seq BioProject (531911) that are auto populated in these sheet unless you are a state public health laboratory, a CDC partner or have been directed to do so by DHQP. The BioProject accession IDs in these files are specifically designated for domestic HAI bacterial pathogen sequencing data, including from the Antimicrobial Resistance Laboratory Network (AR Lab Network), state public health labs, surveillance programs, and outbreaks. For inquiries about the appropriate BioProject location for your data, please contact [email protected].
  • New Terra workflow for combining Phoenix_Summary.tsv, GRiPHin_Summary.tsv and GRiPHin_Summary.xlsx of multiple runs into one file. This workflow will also combine the NCBI excel sheets created when using the --create_ncbi_sheet.
  • software_versions.yml now contains versions for all custom scripts used in the pipeline to streamline its validation process and align it with CLIA requirements, ensuring smoother compliance.
  • MultiQC now contains graphs and data from BBDuk, FastP, Quast and Kraken. BUSCO is also part of MultiQC if the entry point runs it (i.e. CDC_* entries).
  • AMRFinder+ species that are screened for point mutations were updated with Enterobacter asburiae, Vibrio vulfinicus and Vibrio parahaemolyticus.
  • A check was added to ensure only SRR numbers are passed to -entry CDC_SRA and SRA.
  • After extensive QC cut off review addtional warnings and minimum QC cut-offs were added:
    • Minimum PASS/FAIL:
      • %gt; 500 scaffolds
      • FAIry (file integrity check) - see Fixed Bugs section below for details.
    • Warnings:
      • 200-500 scaffolds -> high, but not enough for failure
      • Taxa Quality Checks:
        • FastANI Coverage <90% and Match <95%
        • For entries BUSCO <97%
      • Contamination Checks:
        • <70% of reads/weighted scaffolds assigned to top geneus hit.
        • Added weighted scaffold to kraken <30% unclassifed check (was just on reads before)
        • Added weighted scaffold to kraken only 1 genera >25% of assigned check (was just reads before)

Output File Changes:

  • The default outdir phx produces was changed. If the user doesn't pass --outdir, the default was changed from results to phx_output. This was changed in response to feedback from compliance program, to avoid confusion regarding the difference between public health results (i.e. summary) and diagnostic results (i.e. report).
  • The phx_output/FAIry folder will contain a *_summaryline_failure.tsv file for any isolate where file corruption was detected.
  • *.tax file had the NCBI assigned taxID added after the : for easy lookup.

Fixed Bugs:

  • Updated tower.yml file to reflect file name changes in v2.0.2. This will enable nf-tower reports to properly show up. commit e1b2b91
  • GRiPHin_Summary.xlsx was highlighting coverage outside 40-100x despite --coverage setting, changes made to respect --coverage flag.
  • Added a fix to handle when auto select by the mlst script chooses the wrong taxonomy. PHoeNIx will force a rerun in cases where the taxonomy is known but initial mlst is run against incorrect scheme. Known instances found so far include: E. coli (Pasteur) being incorrectly indentified as Aeromonas and E. coli (Pasteur) being identified as Klebsiella. The scoring in the MLST program was updated and can now cause lower count perfect hits (e.g. 6 of 6 Aeromonas genes at 100%) to be scored higher than novel correct hits (e.g. 7 of 8 at 100%, 1 novel gene).
  • Corrected instance where, in some cases, an mlst scheme could not be determined that a proper out file was not created.
  • Fixed issue with MLST where certain characters in filename would cause array index out of bounds error
  • Fixed issue where samples that failed SPAdes did not have --coverage parameter respected when generating synopsis file.
  • Fixed -entry CDC_SCAFFOLDS providing incorrect headers (missing BUSCO and BUSCO_DB).
  • Updated FAIry (file integrity check) to catch additional file integrity errors.
    • FAIry detects and reports when:
      • Corrupt fastq files that prevents the completion of gzip and zcat and generate a synopsis file when needed.
      • If R1/R2 fastqs that do not have equal number of reads in the files.
      • If there are no reads or scaffolds left after filtering and read trimming steps, respectively.

Container Updates:

Database Updates:

v3.1.0 (04/08/2024)

Implemented Enhancements

  • refactors filtering failed samples for fairy
  • refactors ICA handling, terra handling
  • add a param flags in nextflow.config
    • execution-based
      • run_busco
      • ncbi_excel_creation
      • extended_qc
      • run_srst2_mlst
      • run_griphin
    • feature-based
      • save_trimmed_fail
      • save_merged
      • save_output_fastqs
      • save_reads_assignment
  • moves parameter checks upstream to main.nf
    • ICA
    • TERRA