"Genomic Stability and Genetic Defense Systems in Dolosigranulum pigrum a Candidate Beneficial Bacterium from the Human Microbiome" Supplemental Methods
PPanGGOLiN v1.1.141
(Gautreau et al. 2020; Bazin et al.
2020) was
installed in a
Python environment called PPanGGOLiN.
In order to import the Anvi’o clustering into PPanGGOLiN we needed:
- A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files.
- The annotated genomes with gene IDs that match the ones listed in the previous .tsv file.
Using the anvi-summarize
output
analysis_Anvio7/Pangenomic_Results_Dpig/PAN_DPIG_prokka_gene_clusters_summary.txt.gz
described before we can create the .tsv file:
Dpig_Anvio7 <- read_delim("analysis_Anvio7/Pangenomic_Results_Dpig/PAN_DPIG_prokka_gene_clusters_summary.txt.gz", "\t")
Dpig_Anvio7 <- Dpig_Anvio7 %>%
unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE)
Clusters_Dpig_Anvio7 <- select(Dpig_Anvio7, c(gene_cluster_id, new_id))
write_delim(Clusters_Dpig_Anvio7, "analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv", col_names=FALSE)
IMPORTANT: The gene_callers_id provided by Anvi’o don’t match the
original ones in the Prokka annotation. The Prokka annotated genomes
were parsed into two text files, one for gene calls and one for
annotations, with the script gff_parser.py
. By default Prokka
annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py
will only utilize open reading frames reported by Prodigal in the Prokka
output in order to be compatible with the pangenomic Anvi’o pipeline.
While parsing new gene_callers_id are generated only for the ORFs that
will be imported into Anvi’o. Fortunately
anvi-get-sequences-for-gene-calls
can be used to export new .gff files
with only the ORFs and these can be used for PPanGGOLiN, instead of the
original Prokka .gff files:
#conda activate anvio-7
mkdir -p "analysis_Anvio7/Exported_gffs"
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_o="analysis_Anvio7/Exported_gffs"
for file in $path_f/*.db; do
FILENAME=`basename ${file%.*}`
anvi-get-sequences-for-gene-calls -c $file --export-gff3 \
-o $path_o/Anvio7_$FILENAME.gff
done
We created lists with the file names for both the Anvi’o exported .gff
(analysis_PPanGGOLiN_Anvio7/Anvio7GenomesExported.gff.txt
) and the
renamed .fasta files
(analysis_PPanGGOLiN_Anvio7/Anvio7GenomesReformatted.fa.txt
) and use
them to run the annotate
PPanGGOLiN subcommand. Then, the cluster
PPanGGOLiN subcommand was run using the .tsv file created before from
the anvi-summarize
output.
#conda activate PPanGGOLiN
ppanggolin annotate --anno analysis_PPanGGOLiN_Anvio7/Anvio7GenomesExported.gff.txt --fasta analysis_PPanGGOLiN_Anvio7/Anvio7GenomesReformatted.fa.txt -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7 --basename FromAnvio7
ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv --infer_singletons
The PPanGGOLiNgraph
subcommand has only a single other option, which
is ‘-r’ or ‘–remove_high_copy_number.’ If used, it will remove the
gene families that are too duplicated in your genomes. This is useful if
you want to visualize your pangenome afterward and want to remove the
biggest hubs to have a clearer view. It can also be used to limit the
influence of very duplicated genes such as transposase or ABC
transporters in the partition step.
We ran the PPanGGOLiN graph
and partition
subcommands with default
parameters.
#conda activate PPanGGOLiN
ppanggolin graph -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5
ppanggolin partition -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5
For details on PPanGGOLiN outputs see: https://github.com/labgem/PPanGGOLiN/wiki/Outputs
#conda activate PPanGGOLiN
ppanggolin write -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7 --light_gexf --gexf --csv --Rtab --stats --partitions --projection --families_tsv -f
For details on RGPs see: https://github.com/labgem/PPanGGOLiN/wiki/Regions-of-Genome-Plasticity
#conda activate PPanGGOLiN
ppanggolin rgp -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5
ppanggolin spot -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --label_priority ID --draw_hotspots -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/spots_ID -f
ppanggolin write -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7 --regions --spots -f
The PPanGGOLiN info
subcommand indicates, for each steps of the
analysis, the PPanGGOLiN parameters that were used and the source of the
data if appropriate.
#conda activate PPanGGOLiN
ppanggolin info -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --parameters
The output for this command was:
-
annotation
read_annotations_from_file
: True
-
cluster
read_clustering_from_file
: Trueinfer_singletons
: True
-
graph
removed_high_copy_number_families
: False
-
partition
beta
: 2.5free_dispersion
: Falsemax_node_degree_for_smoothing
: 10computed_K
: TrueK
: 3
-
RGP
persistent_penalty
: 3variable_gain
: 1min_length
: 3000min_score
: 4dup_margin
: 0.05
-
spots
set_size
: 3overlapping_match
: 2set_size
: 3overlapping_match
: 2exact_match
: 1
Bazin, Adelme, Guillaume Gautreau, Claudine Médigue, David Vallenet, and Alexandra Calteau. 2020. “panRGP: A Pangenome-Based Method to Predict Genomic Islands and Explore Their Diversity.” Bioinformatics 36 (Supplement_2): i651–58. https://doi.org/10.1093/bioinformatics/btaa792.
Gautreau, Guillaume, Adelme Bazin, Mathieu Gachet, Rémi Planel, Laura Burlot, Mathieu Dubois, Amandine Perrin, et al. 2020. “PPanGGOLiN: Depicting Microbial Diversity via a Partitioned Pangenome Graph.” Edited by Christos A. Ouzounis. PLOS Computational Biology 16 (3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732.