-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSupplementalMethods_Prokka_Annotations.Rmd
247 lines (174 loc) · 9.04 KB
/
SupplementalMethods_Prokka_Annotations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "Supplemental Methods: Prokka Annotations"
author:
- Tommy Tran, (KLemonLab) [email protected]
- Isabel Escapa, (KLemonLab) [email protected]
output:
rmdformats::robobook:
use_bookdown: true
code_folding: show
bibliography: references.bib
csl: references_style.csl
---
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(c('r', 'bash'))
```
# Custom Prokka Annotations
We used [Prokka v1.14.6](https://github.com/tseemann/prokka) [@Seemann2014] to annotate the 107 *Corynebacterium* strain genomes described in Table S1-A and the 28 *Dolosigranulum pigrum* genomes from our previous manuscript [@floresramos2021]. The total 135 genomes are listed in `CorPGA_AnnotationProkka_GenomeList_v01a.csv` and included in `data/genomes` as `.fasta` files.
We annotated the genomes in two different ways for proper compatibility and strain labeling with both GET_HOMOLOGUES and anvi'o.
## Prokka annotations for GET_HOMOLOGUES
This step annotates all the `.fasta` files in the selected input folder (`path_i`) and places all the output annotated files in the output folder (`path_o`). Output files headers get updated with --genus 'Corynebacterium' --species 'sp' and --strain based on the file name. We used default parameters, including gene recognition and translation initiation site identification with Prodigal [@Hyatt2010].
```{bash, eval=FALSE}
#conda activate Prokka
path_i="data/genomes"
path_o="data/GET_HOMOLOGUES/Prokka_out"
mkdir -p "$path_o"
for file in $path_i/*.f*; do
FILENAME=`basename ${file%.*}`
prokka --prefix $FILENAME --outdir $path_o --genus 'Corynebacterium' --species 'sp' --strain $FILENAME --centre X --compliant --cpus 0 --force $file;
done
```
## Prokka annotation for anvi'o
More information about importing Prokka annotations into anvi'o can be found here: <https://merenlab.org/2017/05/18/working-with-prokka/#note-for-the-pangenomics-workflow>
### Fasta reformatting
Before the annotation step with Prokka we need to reformat the `.fasta` files using `anvi-script-reformat-fasta`. This script creates `.fasta` files with simplified deflines and also by using `--seq-type NT` prevents downstream errors with "characters that are not any of A, C, T, G, N, a, c, t, g, n."
```{bash, eval=FALSE}
#conda activate anvio-dev
path_i="data/genomes"
path_o="data/Anvio8/Reformatted"
mkdir -p "$path_o"
for file in $path_i/*.f*; do
FILENAME=`basename ${file%.*}`
anvi-script-reformat-fasta -o $path_o/$FILENAME.fa --min-len 0 --simplify-names $file --seq-type NT;
done
```
### Prokka annotation
This step repeats the Prokka annotation using the anvi'o reformatted `.fasta` files.
Output files headers get updated with --genus --species and --strain based on the info in the genomes list `.csv` file.
```{bash, eval=FALSE}
#conda activate Prokka
csv_file="data/genome_lists/CorPGA_AnnotationProkka_GenomeList_v01a.csv"
path_i="data/Anvio8/Reformatted"
path_o="data/Anvio8/Prokka_out"
mkdir -p "$path_o"
while IFS=',' read -r name genus species; do
if [[ "$name" != "name" ]]; then # Skip the header
prokka --prefix "$name" --outdir "$path_o" --genus "$genus" --species "$species" --strain "$name" --cpus 0 --force "$path_i/$name.fa"
fi
done < "$csv_file"
```
### Parsing .gff files
This step is to parse Prokka annotated genomes to import both the external Prodigal gene calls and functions independently into anvi'o. The input (`path_i`) is the annotation in GFF3 format and outputs (`path_o`) are two tab-delimited text files, one for gene calls (`calls_*.txt`) and one for annotations (`annot_*.txt`).
This is done with the script `gff_parser.py` described in this [tutorial](https://merenlab.org/2017/05/18/working-with-prokka/).
```{bash, eval = FALSE}
#conda activate gffutils
path_i="data/Anvio8/Prokka_out"
path_o="data/Anvio8/Parsed_prokka"
mkdir -p "$path_o"
for file in $path_i/*.gff; do
FILENAME=`basename ${file%.*}`
python scripts/gff_parser.py $file \
--gene-calls $path_o/calls_$FILENAME.txt \
--annotation $path_o/annot_$FILENAME.txt;
done
```
### Generating contigs databases
In this step the reformatted `.fa` files (`path_i`) and the external gene calls (`calls_*.txt`) from Prokka (`path_e`) get imported to generate anvi’o contig databases (`path_o`). Initially we got a lot of early stop codon errors. Therefore, we add the `–ignore-internal-stop-codons` flag.
```{bash, eval = FALSE}
#conda activate anvio-dev
path_i="data/Anvio8/Reformatted"
path_e="data/Anvio8/Parsed_prokka"
path_o="data/Anvio8/Contigs_db"
mkdir -p "$path_o"
for file in $path_i/*.fa; do
FILENAME=`basename ${file%.*}`
anvi-gen-contigs-database -f $file \
-o $path_o/$FILENAME.db \
--external-gene-calls $path_e/calls_$FILENAME.txt \
--ignore-internal-stop-codons \
-n $FILENAME;
done
```
### Importing Prokka functional annotation
Finally, the external functional annotations (`annot_*.txt`) from Prokka (`path_e`) get imported into the Anvi’o contigs databases (`path_i`).
```{bash, eval = FALSE}
#conda activate anvio-dev
path_i="data/Anvio8/Contigs_db"
path_e="data/Anvio8/Parsed_prokka"
for file in $path_i/*.db; do
FILENAME=`basename ${file%.*}`
anvi-import-functions -c $file \
-i $path_e/annot_$FILENAME.txt
done
```
# NCBI annotations
The 9 genome assemblies listed in **Table S3D** `CorPGA_AnnotationNCBI_GenomeList_v01a.csv` and included in `data/genomes_NCBIAnnotation` where used as references for comparisons on the KEGG metabolic analysis only. These genomes were not part of the phylogenetic/pangenomic analysis, and we were interested in keeping the original NCBI annotations in the anvi'o contig databases.
They were downloaded from NCBI in `.gbff` format and processed as follows:
## Parsing .gbff files downloaded from NCBI
First we used `anvi-script-process-genbank` to parse the `.gbff` files (`path_i`) in order to output (`path_o`) the corresponding properly formatted `.fa` files, plus two tab-delimited text files, one for gene calls (`calls_*.txt`) and one for annotations (`annot_*.txt`).
```{bash, eval = FALSE}
#conda activate anvio-dev
path_i="data/genomes_NCBIAnnotation"
path_o="data/Anvio8/Parsed_NCBI"
mkdir -p "$path_o"
for file in $path_i/*.gbff; do
FILENAME=`basename ${file%.*}`
anvi-script-process-genbank -i $file \
--output-gene-calls $path_o/calls_$FILENAME.txt \
--output-functions $path_o/annot_$FILENAME.txt \
--output-fasta $path_o/$FILENAME.fa \
--annotation-source prodigal
done
```
## Fixing annotation files
The parsed annotation files from the original NCBI annotations describe the annotation source as "prodigal", but in the rest of the files that we have manually annotated describe the source as "Prodigal". Later anvi'o will not be able to cluster together genomes with different annotation sources, so we need a consistent name. Here we iterate through all `*.txt` files in the folder and perform text replacement from "prodigal" to "Prodigal".
```{bash, eval = FALSE}
path_i="data/Anvio8/Parsed_NCBI"
for file in "$path_i"/*.txt; do
if [ -f "$file" ]; then
# Create a temporary file for the updated content
tmp_file=$(mktemp)
# Perform the text replacement and save it to the temporary file
sed 's/prodigal/Prodigal/g' "$file" > "$tmp_file"
# Replace the original file with the temporary file
mv "$tmp_file" "$file"
echo "Text replacement completed for $file"
fi
done
```
## Generating contigs databases
In this step the `.fa` files and the gene calls from NCBI (`path_i`) get imported to generate anvi'o contigs databases (`path_o`).
```{bash, eval = FALSE}
#conda activate anvio-dev
path_i="data/Anvio8/Parsed_NCBI"
path_o="data/Anvio8/Contigs_db"
mkdir -p "$path_o"
for file in $path_i/*.fa; do
FILENAME=`basename ${file%.*}`
anvi-gen-contigs-database -f $file \
-o $path_o/$FILENAME.db \
--external-gene-calls $path_i/calls_$FILENAME.txt \
-n $FILENAME;
done
```
## Importing Prokka functional annotation
Then the NCBI external annotations (`path_e`) get imported into the Anvi'o contigs databases (`path_i`).
```{bash, eval = FALSE}
#conda activate anvio-dev
path_i="data/Anvio8/Contigs_db"
path_e="data/Anvio8/Parsed_NCBI"
for file in $path_i/*.db; do
FILENAME=`basename ${file%.*}`
anvi-import-functions -c $file \
-i $path_e/annot_$FILENAME.txt
done
```
# References {.unnumbered}
::: {#refs}
:::
<br>
<img src="images/Department-of-Molecular-Virology-&-Microbiologyy-Horz-GRAY.png" align="left" width="240" height="110"/>
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------