This repository contains training and testing sets for the PlASgraph2 tool.
Each set is consists of
- A
.csv
file listing all assemblies (e.g.eskapee-test.csv
) with columns: path togfa.gz
file, path togfa.csv
file and an identifier of the sample - A folder with files for each assembly (more detailed description of both files is below):
gfa.gz
file is a GFA file with the assembly graph, compressed by gzipgfa.csv
is a file with the correct classification of each contig
PlASgraph2 was developed with GFA files produced by Unicycler and SKESA. In principle, plASgraph2 should be usable with other assemblers that use the GFA format. However, one of the features is the read coverage of a node which is currently obtained from GFA files as follows:
- If nodes have the
dp
tag, containing normalized read depth computed by Unicycler, it is used as read depth of the node. - If nodes have the
KC
tag contining k-mer count reorted by SKESA, its value is divided by the length of the sequence corresponding to the node.
In both cases, the coverage is normalized by dividing it with weighted mean of coverages of all nodes. As a result, chromosome contigs are expected to have coverage close to 1. However, for Unicycler this step does not change the values much because a similar procedure was already done.
For other assemblies, make sure that the GFA contains a dp
or KC
tag with a similar meaning. If the assembler does not provide this information, you can align the source reads to the assembly and label the nodes with read coverage obtained from the alignments.
The CSV file with correct classification should contain the following columns (plus any other optional columns):
contig
: contig id from thegfa.gz
filelabel
: one of the stringschromosome
,plasmid
,ambiguous
orunlabeled
. Labelambiguous
means that the contig should be correctly classified as both a chromosome and a plasmid (e.g. a transposon present in both molecules of the given sample) andunlabeled
means that the correct label is unknown, e.g. due to short contig size.length
: the contig length (used for evaluation only, not needed for training)chrom_score
: for golder answer, this should be 1 for chromosome and ambiguous and 0 for plasmid and unknown (if omitted, score will be filled in according tolabel
). For predictions of actual tools, the score should be the confidence that the contig belongs to the chromosome class.plasmid_score
: should be 1 for plasmid and ambiguous and 0 for chromosome and unknown; otherwise analogous tochrom_score
.
ESKAPEE species are Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp and Escherichia coli
eskapee-train
: The training set of samples from ESKAPEE species (internally split to training and validation). It contains 70 samples, 10 samples from each species. For each sample there are 2 assemblies (Unicycler and SKESA).eskapee-test
: The testing set of samples from ESKAPEE species, 112 samples in total (E.fae. 2, S.aur. 31, K.pne. 46, A.bau. 5, P.aer. 5, E.spp 15, E.col 8)
cfre-test
: Citrobacter freundii, 50 samplesefer-test
: Escherichia fergusonii, 50 sampleskoxy-test
: Klebsiella oxytoca, 31 samplesmyco-test
: mycobacteria (genera Mycobacterium, Mycobacteroides, Mycolicibacterium, Mycolicibacter), 30 samplessent-test
: Salmonella enterica, 29 samples
reference_genomes.csv
contains the list of chromosome references (roughly one per species used in the study) used in our paper to aid golden standard annotation of contigs in hybrid assemblies. The csv file has two columns: accession number of the sequence and description of the sequenceall_samples.csv
contains the list of all samples in all our sets. Columns:dataset
which dataset uses this sampleour_id
our sample id which is used with suffix-s
or-u
based on assembler usedsample_id
typically NCBI/ENA/DDBJ accession of the bacterial sample, except for samples from Arredondo-Alonso el al 2018, where their ID is usedshort_reads
accession of short reads in SRA. Missing in samples from Boostrom et al. 2022 where short reads provided by the authorslong_reads
accession of long reads in SRA. Missing in samples from Arredondo-Alonso et al. 2018 where long reads provided by the authors at figsharelong_reads_type
Nanopore or PacBioreference
source of the samplespecies
bacterial species