-
Notifications
You must be signed in to change notification settings - Fork 16
User Manual
This user manual is designed as a reference for users who have questions about mgefinder. It includes details about how to execute important commands, the parameters available, brief descriptions of how the command works, and descriptions of the output files.
We now describe each of the mgefinder commands.
This command strings together multiple mgefinder commands into a single snakemake workflow that can be parallelized. If you have multiple files that you want to analyze, this command is typically the best way to do so. If you want to learn more about snakemake, please read here.
The input to this command is a working directory that includes all of the necessary files in a specific directory structure. For a specific example, please see the tutorial. Here is the general directory structure:
test_workdir/
├── 00.assembly/
│ ├── <sample1>.fna
│ ├── <sample2>.fna
│ ├── <sample3>.fna
│ └── ...
├── 00.bam/
│ ├── <sample1>.<genome>.bam
│ ├── <sample1>.<genome>.bam.bai
│ ├── <sample2>.<genome>.bam
│ ├── <sample2>.<genome>.bam.bai
│ ├── <sample3>.<genome>.bam
│ ├── <sample3>.<genome>.bam.bai
│ └── ...
└── 00.genome/
├── <genome>.fna
└── ...
Be sure to keep the file naming consistent with what is shown above, including directory names and suffixes. refers to the name of the genome to which the sample was aligned. Make sure that the only periods in your file names are the ones shown above.
The command can be run as:
mgefinder workflow denovo WORKDIR
to run the de novo discovery workflow, or:
mgefinder workflow database WORKDIR
To run it in database-only mode, when assemblies of the isolates are not available or you want to focus on just a few elements of interest.
You can also run it with a custom snakefile and configfile using:
mgefinder workflow custom WORKDIR
Additional parameters include:
--snakefile, -s TEXT
--configfile, -c TEXT
--cores, -t INTEGER
--memory, -m INTEGER
--unlock / --no-unlock
--rerun-incomplete / --no-rerun-incomplete
--keep-going / --no-keep-going
--sensitive / --original
The parameter --snakefile
takes a text string specifying the path to the Snakefile that you want to use. This is only available when running mgefinder workflow custom
.
The parameter --configfile
takes a text string specifying the path to the config file that you want to use. This is only available when running mgefinder workflow custom
.
The parameter --cores
takes an integer and specifies the number of available cores for parallelizing those commands that can be parallelized. The default is 1. We recommend using as many cores as you have available.
The parameter --memory
takes an integer and species the memory limit in megabytes. The default is 16000, you can change this amount to fit your machine's available memory.
The parameter --unlock
works the same way as the --unlock
parameter provided to Snakemake. It unlocks a working directory that was locked due to some error. This is off by default.
The parameter --rerun-incomplete
works the same as the Snakemake parameter. When this is selected, the rules used to generate files that are found to be incomplete will be rerun to produce complete files. This is off by default.
The parameter --keep-going
works the same as the Snakemake parameter. When this is selected, parallel jobs will continue as far as they can even if one of them fails. This is off by default.
The parameter --sensitive / --original
works is used to indicate the config file that is used when running the workflow. In --sensitive
mode, parameters are adjusted to make it easier to detect insertions that cause direct repeats that reach up to 50 bp in length. This will most likely result in false positives, and it could actually slightly reduce sensitivity for elements with small direct repeats. In --original
mode, the workflow runs with the original settings as they were specified and validated in the original study.
This command works by executing a snakemake workflow using the provided workflow directory. This includes calling several other mgefinder commands in succession to produce a complete analysis. To understand individual steps in the workflow, read their respective descriptions in this manual.
This workflow generates three more directories in the provided working directory:
01.mgefinder
02.database
03.results
The 01.mgefinder
directory contains intermediate MGEfinder files that may be of interest to individuals with
more advanced knowledge of how the tool works. This includes the output of the find
, pair
, and inferseq
commands.
The 02.database
directory contains a dynamically-constructed database of elements that were identified when running MGEfinder. These are necessary for the inferseq-database
commands. This file
may be of value to the user, depending on the use case.
Most of the files of value can be found in the 03.results
directory. They will be named according to the following format:
03.results/
└── <genome>/
├── 01.clusterseq.<genome>.tsv
├── 02.genotype.<genome>.tsv
├── 03.summarize.<genome>.clusters.tsv
├── 03.summarize.<genome>.groups.tsv
├── 04.makefasta.<genome>.all_seqs.fna
└── 04.makefasta.<genome>.repr_seqs.fna
The file 01.clusterseq.<genome>.tsv
a complete list of all inferred sequences
for all ten files. This file includes the sample
, pair_id
(refers to results of the pair
output files),
method
(sequence inference method, such as inferred_assembly_with_full_context
), loc
(the location
of where the sequence was found, either in the reference genome, assembly, or dynamically-constructed database),
the inferred_seq_length
, the seqid
(an identifier for each unique sequence), cluster
(an identifier for the
sequence cluster), group
(an identifier for the cluster group), and the inferred_seq
(nucleotide sequence of
the element).
The file 02.genotype.<genome>.tsv
includes insertion genotypes for all of the isolates. This
includes the sample
, the position of the insertion within the reference genome, the seqid
, cluster
, and
group
that indicates the identity of the inserted sequence. The final column indicates the conf
(confidence)
level of the insertion. This refers primarily to how the identity of the sequence was inferred. The highest confidence inferred sequence is IAwFC
, which indicates that the insertion was found in the expected location in the assembly. The degree of confidence that you should accept depends largely on the use case. If, for example,
you are performing a re-sequencing experiment, it may be sufficient to use the IDB
column, which includes
sequences inferred from the reference genome and/or the dynamically-constructed database.
The files 03.summarize.<genome>.clusters.tsv
and 03.summarize.<genome>.groups.tsv
provide summary statistics for each of the sequence clusters and groups, respectively. This can be used to perform further QC on the identified elements, and to stratify elements by their transposability. Some important columns include
num_unique_sites_all
, which refers to the number of unique sites in the reference genome where sequence cluster was found at least once in at least one isolate. and num_unique_sites_unambig
, which refers to the number of unique sites where the cluster is the unambiguously assigned element (often two clusters will be assigned to a single site when the exact element cluster cannot be identified). See the manual for more details on these output files.
The files 04.makefasta.<genome>.all_seqs.fna
and 04.makefasta.<genome>.repr_seqs.fna
include FASTA
files of all identified elements found in the genotype
and summarize
files. The 04.makefasta.<genome>.all_seqs.fna
includes all unique sequences identified (even if they differ by a single base pair), and 04.makefasta.<genome>.repr_seqs.fna
includes the representative sequence for each element cluster.
This is a very simple command provided for convenience. It just copies the default Snakefiles and config files for the workflow
commands to the current working directory. These files can then be edited as the user wishes, and then they can be used with the mgefinder workflow custom
command and the --snakefile
and --configfile
parameters. For example, if you want to run the mgefinder workflow custom
command but use different parameters for individual rules, you can edit the configfiles provided with the getworkflow
command.
To execute the command, simply type
mgefinder getworkflow
And the Snakefiles and config files will be copied to the current directory.
This command finds insertion sites and reconstructs the termini of inserted sequence.
The find
command takes a BAM file as input. YOU MUST USE
BWA MEM TO GENERATE THIS BAM FILE. We have found that other aligners, such as bowtie2, are not compatible with MGEfinder.
You can run the command as
mgefinder find BAMFILE
Additional parameters include:
--min_softclip_length, -minlen INTEGER
--min_softclip_count, -mincount INTEGER
--min_alignment_quality, -minq INTEGER
--min_alignment_inner_length, -minial INTEGER
--min_distance_to_mate, -mindist INTEGER
--min_softclip_ratio, -minratio FLOAT
--max_indel_ratio, -maxir FLOAT
--large_insertion_cutoff, -lins INTEGER
--min_count_consensus, -mcc INTEGER
--sample_id, -id TEXT
--output_file, -o TEXT
The parameter min_softclip_length
takes an integer. For a softclipped site to be considered, there must be at least one softclipped read of this length. The default used is 8.
The parameter min_softclip_count
takes an integer. It requires that for a clipped site to be considered it must be supported by at least this number of reads. The default used is 2.
The parameter min_alignment_quality
takes an integer. This determines the minimum mapping quality of each alignment in order for it to be considered. The default used is 20.
The parameter min_alignment_inner_length
takes an integer. It filters out reads that are softclipped on both ends, and where the inner aligned portion is shorter than this length. Filtering out such reads removes a lot of potential false positives. The default is 21, if this default is changed to include insertions that create larger direct repeats, you should also increase the min_distance_to_mate
parameter to be min_alignment_inner_length
+ 1, the pair
command parameter min_alignment_inner_length
must match the find min_alignment_inner_length
parameter, and pair
command parameter max_direct_repeat_length
should equal min_alignment_inner_length
- 1. If none of your reads are able to span this distance when aligning to your reference genome, then all reads will be excluded.
The parameter min_distance_to_mate
takes an integer. It filters out clipped-sites that have no other candidate terminus pair nearby. In the pair
step, we only pair termini with each other if they are within 20 bp of each other. Filtering out sites without a candidate pair can dramatically speed up the analysis. However, if you want to keep all candidate insertion sites, you can increase this number to some very high integer. The default is 22.
The parameter min_softclip_ratio
takes a float between 0 and 1.0. It sets a minimum for the ratio of clipped reads to all reads at a given site. This filters out very small indels that could disrupt the terminus-pairing step. The default used is 0.15.
The parameter max_indel_ratio
takes a float between 0 and 1.0. For a softclipped site to be considered, the of small insertions/deletions (as defined by large_insertion_cutoff
) at this site must below this value. The default used is 0.03.
The parameter large_insertion_cutoff
takes an integer. Some reads are large enough to span both sides of an indel. For example, a 300 bp read may contain a 20 bp deletion relative to the reference. This parameter determines what should be considered an indel, and thus ignored, and what should be considered a large insertion and thus kept. The default is 30, meaning that is larger than 30 base pairs will be considered a large insertion.
The parameter min_count_consensus
takes an integer. When generating a consensus sequence for the insertion terminus, it requires that each site must have at least this many reads supporting it. The default used is 2.
The parameter sample_id
takes a text string. This is used to specify an ID to use for this sample. the default is the absolute path to the BAM file.
The parameter output_file
takes a text string. This is the final file where the find
results will be saved. The default is mgefinder.find.tsv
.
The find
algorithm works by identifying candidate insertion sites by searching for clipped-end sites in locally
aligned reads. To generate a consensus sequence of the candidate terminus, we use a trie-based approach intended to filter out spurious reads when building a consensus sequence. This also allows our algorithm to distinguish between multiple insertions at a single site, which may be observed in a metagenomic sample, for example (analysis of metagenomic samples is in development).
Briefly, we build a sequence Trie of all of the reads found clipped at a given site. We then generate all of the unique, complete sequences by traversing the trie. We then cluster these sequences together by sequence similarity, and process the different clusters separately.
We then generate a consensus sequence for a given insertion termines by traversing down the trie, choosing the base with the most support at each step. Evidence for a given base is calculated as the sum of all base-pair quality scores originally reported in the FASTQ file.
The consensus sequence ends when the number of reads supporting a position drops below --min_count_consensus
.
By default, find
will write to a file named mgefinder.find.tsv
(The name can be changed with the
--output_file
parameter). The columns of this file are described as follows:
-
sample
- A user-specified field used to identify the sample. -
terminus_id
- An arbitrary identifier used internally by mgefinder. -
contig
- The name of the contig where the terminus was identified. -
pos
- The exact base pair position of where the clipped-ends begin, the presumed start of the insertion. This location is 0-based with respect to the reference genome used. -
orient
- The orientation of the terminus.5p
refers to the 5' end of an insertion, which suggests this terminus runs from the 5' to the 3' direction with respect to the reference genome.3p
refers to the 3' end of an insertion, which suggests the terminus runs from the 3' to the 5' direction with respect to the reference genome. -
softclip_count_5p
- The total number of reads clipped from the 5' to 3' direction at this site. -
softclip_count_3p
- The total number of reads clipped from the 3' to 5' direction at this site. -
runthrough_count
- The total number of reads that run through the insertion site without being clipped. -
small_insertion_count_5p
- The total number of reads that include a small insertion from the 5' to 3' direction at this site. If this number is high, it indicates that the detected insertion is most likely a small indel.find
uses this to filter out false positives. -
small_insertion_count_3p
- Same as above, but for small insertions running from the 3' to the 5' direction. -
deletion_count
- The number of reads with a deletion at the insertion site, also used to filter out false positives. -
upstream_deletion_count
- The number of reads that contain a deletion one base pair upstream of this insertion site. Also useful for removing false positives -
downstream_deletion_count
- Same as above, but for reads with deletions one base pair downstream of this insertion site. -
total_count
- The sum of thesoftclip_count_5p
,softclip_count_3p
,runthrough_count
,small_insertion_count_5p
,small_insertion_count_3p
, anddeletion_count
columns. -
consensus_softclip_count
- The number of clipped reads at this site that specifically support the consensus terminus sequence on this line. This will be used as thesoftclip_count_5p
andsoftclip_count_3p
columns in the file generated by running thepair
command. -
consensus_seq
- The consensus sequence for the insertion terminus identified at this site.
This intermediate file is not particularly valuable to users on its own, and it is primarily intended to be used internally by mgefinder in the next step.
This command pairs insertion termini with each other to represent the 5' and 3' termini of a candidate insertion.
You can run the pair
command as
mgefinder pair FINDFILE BAMFILE REFERENCE_GENOME
Where FINDFILE
is the output of the find
command, BAMFILE
is the binary sequence alignment file used as
input for find
, and REFERENCE_GENOME
is the reference genome that BAMFILE
was aligned to.
Additional parameters include:
--max_direct_repeat_length, -maxdr INTEGER
--min_alignment_quality, -minq INTEGER
--min_alignment_inner_length, -minial INTEGER
--max_junction_spanning_prop, -maxjsp FLOAT
--large_insertion_cutoff, -lins INTEGER
--output_file, -o TEXT
The parameter max_direct_repeat_length
takes an integer. This specifies the maximum distance that oppositely-oriented insertion termini can be from each other in order to consider pairing them together. Since insertions often cause direct repeats at the insertion site, the location of the insertion termini are often separated by several base pairs. The default for this parameter is 20. This means that our algorithm by default will NOT identify true insertions that create direct repeats that exceed 20 base pairs. If you want to increase this parameter, you must also change the find
parameter min_distance_to_mate
to be min_alignment_inner_length
+ 1, the pair
command parameter min_alignment_inner_length
must match the find min_alignment_inner_length
parameter, and pair
command parameter max_direct_repeat_length
should equal min_alignment_inner_length
- 1.
The parameter min_alignment_quality
takes an integer. This determines the minimum mapping quality of each alignment in order for it to be considered. The default used is 20. This should be the same as the min_alignment_quality
parameter in find
.
The parameter min_alignment_inner_length
takes an integer. It filters out reads that are softclipped on both ends, and where the inner aligned portion is shorter than this length. Filtering out such reads removes a potential false positives. The default is 21, this parameter should not be changed, and it should be the same as the value of --min_alignment_inner_length
used in find.
If you want to increase this parameter, you must also change the find
parameter min_distance_to_mate
to be min_alignment_inner_length
+ 1, the pair
command parameter min_alignment_inner_length
must match the find min_alignment_inner_length
parameter, and pair
command parameter max_direct_repeat_length
should equal min_alignment_inner_length
- 1.
The parameter max_junction_spanning_prop
takes a decimal between 0 and 1. This is used to filter out low-confidence insertion pairs, or insertions that are occurring in duplicated regions. We expect that very few reads will fully span the insertion junction, as that would indicate that the insertion site does not contain the insertion somewhere in the isolate. If the number of reads spanning the insertion junction without being clipped exceeds this proportion of the total reads at the site, then it will be ignored. By default, this parameter is 0.15.
The parameter large_insertion_cutoff
takes an integer. Some reads are large enough to span both sides of an indel. For example, a 300 bp read may contain a 20 bp deletion relative to the reference. This parameter determines what should be considered an indel, and thus ignored, and what should be considered a large insertion and thus kept. The default is 30, meaning that is larger than 30 base pairs will be considered a large insertion.
The parameter output_file
takes a text string. This is the final file where the find
results will be saved. The default is mgefinder.pair.tsv
.
The pair
command uses a variety of techniques to pair termini with each other. It first filters insertion termini by the --max_direct_repeat_length
parameter. It then does pairwise comparisons between all nearby, oppositely-oriented insertion termini to determine if they share inverted repeats at their termini, a common feature of many prokaryotic insertion sequences. It then prioritizes all pairs by the following parameters, in order:
- The length of the inverted repeat identified, if any. Pairs with longer shared inverted repeats receive higher priority.
- The difference in the number of softclipped reads that support the insertion termini. Pairs that have a similar number of reads supporting each terminus are given higher priority.
- The difference in the length of the recovered insertion termini. If the two insertion termini have very similar lengths, they are given higher priority than pairs with disparate terminus lengths.
It then filters sites with too many junction-spanning reads (see max_junction_spanning_prop
), and infers the identity of the direct repeat created by the insertion, if any. We realize that these are somewhat arbitrary filters, and we welcome suggestions from the users on how this might be improved.
By default, pair
will write to a file named mgefinder.pair.tsv
(The name can be changed with the
--output_file
parameter). The columns of this file are described as follows:
-
sample
- A user-specified field used to identify the sample, carried over from the FINDFILE. -
pair_id
- An arbitrary identifier used to identify the candidate insertion. This is an important identifier that is used to relate the output of thepair
command to the output of the variousinferseq
commands. -
contig
- The name of the contig where the pair was identified. -
pos_5p
- The exact base pair position of where the clipped-ends of the 5' insertion terminus begin, the presumed start of the insertion. This location is 0-based with respect to the reference genome used. -
pos_3p
- The exact base pair position of where the clipped-ends of the 3' insertion terminus begin, the presumed end of the insertion. This location is 0-based with respect to the reference genome used. -
softclip_count_5p
- The number of clipped reads at this site that support the 5' insertion terminus. This is taken directly from theconsensus_softclip_count
of thefind
output file. -
softclip_count_3p
- The number of clipped reads at this site that support the 3' insertion terminus. This is also taken directly from theconsensus_softclip_count
of thefind
output file. -
total_count_5p
- The total number of reads found at the insertion site for the 5' terminus. -
total_count_3p
- The total number of reads found at the inseriton site for the 3' terminus. -
spanning_count
- The total reads that are found to span the 5' and 3' insertion sites by 10 bp in both directions. -
has_IR
- A boolean True/False indicating whether or not the terminus pair contains inverted repeats at their end, a common feature of many inserted elements. -
IR_length
- The length of the detected inverted repeat. -
IR_5p
- The inverted repeat in the 5' terminus. -
IR_3p
- The inverted repeat in the 3' terminus. -
seq_5p
- The consensus sequence of the 5' terminus. -
seq_3p
- The consensus sequence of the 3' terminus. -
direct_repeat_reference
- The sequence of the direct repeat created, using the reference genome to infer this direct repeat.
This file will likely have intrinsic value to the user, as they indicate candidate insertions. However, these candidates must be further investigated before any final conclusions about their identity can be made. In some cases, for example, these may actually represent genomic inversions, not insertions. To infer the identity of these insertions, several inference approaches are implemented and described in the following sections.
This command infers the identity of insertions by aligning the termini of candidate pairs to a reference genome.
The inferseq-reference
command takes the output of the pair
command and a reference genome as input.
While mgefinder will automatically index the reference genome if it is not already indexed, we recommend that you index the reference genome beforehand with the command:
bowtie2-build -o 0 -q INFERSEQ_REFERENCE INFERSEQ_REFERENCE
Where INFERSEQ_REFERENCE is the reference genome of interest.
BWA MEM should be used to generate this BAM file.
You can then run the command as
mgefinder inferseq-reference PAIRSFILE INFERSEQ_REFERENCE
Additional parameters include:
--min_perc_identity, -minident FLOAT
--max_internal_softclip_prop, -maxclip FLOAT
--max_inferseq_size, -maxsize INTEGER
--min_inferseq_size, -minsize INTEGER
--keep-intermediate/--no-keep-intermediate
--output_file, -o TEXT
The parameter min_perc_identity
takes an float between 0 and 1. When aligning candidate insertion termini to the
reference genome, it will only consider alignments that exceed this percentage identity with the reference. The default for this parameter is 0.95.
The parameter max_internal_softclip_prop
takes an float between 0 and 1. This is an additional candidate insertion terminus alignment filter. If the aligned termini are internally clipped by a proportion of their total length that exceeds this number, then the alignment is excluded. For example, imagine the alignment:
--------> <----------
terminus1 terminus2
Imagine the vertical bar |
indicating that one of the alignments is clipped at a specific site, such as:
--------> <---|-------
terminus1 terminus2
If the proportion of the clipped end of the terminus2 alignment exceeds 0.05 (by default), then this alignment will be excluded.
The parameter max_inferseq_size
excludes inferred sequences that are larger than max_inferseq_size
. The default for this parameter is 500 kilobase pairs, but we urge users to be cautious when working with very large insertions, as these may be false positives.
The parameter min_inferseq_size
excludes inferred sequences that are smaller than min_inferseq_size
. The default for this parameter is 1 base pair, but in reality mgefinder is not well suited to identify insertions that are smaller than the read length of the sequencing library.
The --keep-intermediate/--no-keep-intermediate
will determine whether or not the intermediate alignment file will be deleted or kept, which can be useful for debugging purposes.
A schematic of how this step is implemented is shown above, but more details are given here. First, if the reference genome is not already bowtie2 indexed, mgefinder will index the genome. Next, it will align the individual termini to the reference genome. It will then filter the alignments according to the min_perc_identity
parameter, and it will also filter alignments that are clipped to on their outer edge. It will then iterate through the sorted alignments and pair termini with each other. This approach will actually take the minimum non-overlapping alignment pairs, and ignore all other combinations of alignments, as described in the schematic below:
Next, candidate pairs will be filtered according to the max_internal_softclip_prop
parameter, and the
inferred sequence size filters (max_inferseq_size
and min_inferseq_size
). Finally using the sum of the alignment scores of both termini in each pair, we filter out all pairs that do not have the maximum alignment score. This leaves us with a set of final inferred sequences of equal quality.
By default, inferseq-reference
will write to a file named mgefinder.inferseq_reference.tsv
(The name can be changed with the --output_file
parameter). The columns of this file are described as follows:
-
sample
- A user-specified field used to identify the sample, carried over from the PAIRSFILE. -
pair_id
- An identifier used to map the inferred sequence in this file to the candidate insertion pairs described in thepair
file. That is, all lines in themgefinder.inferseq_reference.tsv
with the value20
in thepair_id
column represent an inferred sequence identified by theinferseq_reference
command that corresponds to the candidate insertion pair identified in thepair
file that is labeled with20
in thepair_id
column. -
method
- The method used to infer the sequence, which will always beinferred_reference
when running theinferseq-reference
command. -
loc
- The location of the inferred sequence as it was found in the reference genome. This is in the formatCONTIG:START-END
whereCONTIG
is the contig where the inferred sequence was located,START
is the base pair start of the inferred sequence, andEND
is the base pair end of the inferred sequence. -
inferred_seq_length
- The length of the inferred sequence. -
inferred_seq
- The identity of the inferred sequence in full.
This file is of value to the user, as it infers the full identity of the candidate insertions by referring to a reference genome. In certain situations this may be sufficient, such as when the sequenced isolate differs only slightly from the reference genome (like in a resequencing experiment). But if the inserted element may not exist in an available reference genome, alternative inference approaches may be of interest (see below).
It should also be noted that the reference genome used here does not necessarily have to be the reference genome that the sequencing reads were aligned to.
This command infers the identity of insertions by aligning the termini of candidate pairs to a sequence assembly of the isolate of interest. We recommend using SPAdes to assemble the genome of the isolate of interest.
The inferseq-assembly
command takes as input the output of the pair
command, the BAM file of the sequencing
reads aligned to the reference genome, an assembly file in FASTA format of the sequencing reads derived from the isolate of interest, and the reference genome used when initially aligning the isolates reads. In this step, it is imperative that the sequence assembly comes from the isolate of interest.
While mgefinder will automatically index the sequence assembly if it is not already indexed, we recommend that you index the assembly beforehand with the command:
bowtie2-build -o 0 -q INFERSEQ_ASSEMBLY INFERSEQ_ASSEMBLY
You can then run the command as
mgefinder inferseq-assembly PAIRSFILE BAMFILE INFERSEQ_ASSEMBLY INFERSEQ_REFERENCE
Additional parameters include:
--min_perc_identity, -minident FLOAT
--max_internal_softclip_prop, -maxclip FLOAT
--max_inferseq_size, -maxsize INTEGER
--min_inferseq_size, -minsize INTEGER
--keep-intermediate/--no-keep-intermediate
--output_file, -o TEXT
These additional parameters are identical in function to the parameters described in the
"inferseq-reference
: Input and parameters" section above. Please refer to this section for more details.
Details about how this step is implemented are shown in Figures a and b above. The sequence of steps is very similar to those described in the "inferseq-reference
: Description of implementation" section above, with important differences. Two alignments to the assembly are performed. In the first step, 25 base pairs of reference genome are appended to the beginning of the termini before alignment (See Figure a above). This allows us to determine if the insertion has assembled within the expected sequence context, which gives us greater confidence that the inferred identity of the insertion is of high quality. Then the termini are aligned without the appended context sequence to infer the identity of the insertion. Both of these types of inferred sequences are then reported in the output file.
By default, inferseq-assembly
will write to a file named mgefinder.inferseq_assembly.tsv
(The name can be changed with the --output_file
parameter). The columns of this file are the same as those found in the
"inferseq-reference
: Output file format" section above, but with the following differences:
-
method
- The method used to infer the sequence. With theinferseq-assembly
command, this will be one of three different values: 1)inferred_assembly_with_full_context
, meaning that the inserted was inferred from the sequence assembly with the 25 base pairs of context sequence appended on both sides, the highest quality inferred sequence. 2)inferred_assembly_with_half_context
means that the sequence was inferred in context, but one side of the inserted element was found at the end of a contig, meaning that the sequence context of one side is still ambiguous. 3)inferred_assembly_without_context
indicates the the sequence was inferred from the assembly, but without the additional sequence context. This is common when the inserted sequence is a large repeated element and it cannot be properly assembled. -
loc
- The location of the inferred sequence as it was found in the sequence assembly. This is in the formatCONTIG:START-END
whereCONTIG
is the contig where the inferred sequence was located,START
is the base pair start of the inferred sequence, andEND
is the base pair end of the inferred sequence.
This file is of value to the user, as it infers the full identity of the candidate insertions by referring to sequence assembly. This is of value when the sequenced isolate and the reference genome used are assumed to be quite different from each other. It is limited, however, by the quality of the genome assembly. Large mobile genetic elements that cannot be properly assembled will be missed.
In certain situations, the identity of the inserted sequence can be inferred by attempting to overlap the termini. For example:
terminus1 ----------------->
||||
<-------------------terminus2
In the situation depicted above, terminus1
and terminus2
overlap, and they can be merged into a single insertion.
This command performs this merging operation, if possible.
The inferseq-overlap
command takes as input only the output of the pair
command.
You can run the command as
mgefinder inferseq-overlap PAIRSFILE
Additional parameters include:
--min_overlap_score, -minscore INTEGER
--min_overlap_perc_identity, -minopi FLOAT
--minsize, -min_inferseq_size FLOAT
--output_file, -o TEXT
The parameter min_overlap_score
determines the minimum alignment score necessary for two termini to be merged.
Matched positions increase the score by 1, and mismatch positions decrease the score by 1.
The parameter min_overlap_perc_identity
is the minimum sequence identity between the overlapping portion of the two termini that is required for them to merge.
The parameter min_inferseq_size
is the minimum overlapped consensus sequence size that will be kept in the final output, by default the parameter is 30, and anything smaller than this size will be excluded from the results.
The merging procedure taken here is quite simple. Termini are scanned across each other one base pair at a time.
The overlapping portions are checked for matching base pairs, with matches increasing the overlap score by one, and mismatches decreasing the score by one. The parameters min_overlap_score
and min_overlap_perc_identity
are used to determine if the termini are merged into a single sequence.
By default, inferseq-overlap
will write to a file named mgefinder.inferseq_overlap.tsv
(The name can be changed with the --output_file
parameter). The columns of this file are the same as those found in the
"inferseq-reference
: Output file format" section above, but with the following differences:
-
method
- The method used to infer the sequence which will always beinferrred_overlap
when using theinferseq-overlap
command. -
loc
- The portions of terminus1 and terminus2 that are kept to form the final merged sequence. This is in the formseq_5p:START_5p-END_5p;seq_3p:START_3p-END_3p
.
This is valuable to the user primarily for filtering out insertions that are smaller in size, as this step is fundamentally limited by the read length of the library. Mustache is not well-suited to identify smaller insertions, and this step can be used to filter out these small insertions so they do not interfere with downstream analyses.
This command infers the identity of insertions by aligning the termini of candidate pairs to a database of query sequences. Ideally, this database is filtered to remove redundant sequences, which may dramatically reduce the amount of time required to run this step.
The inferseq-database
command takes as input the output of the pair
command, and a FASTA file of the query
sequences of interest.
While mgefinder will automatically index the query sequence database if it is not already indexed, we recommend that you index the database beforehand with the command:
bowtie2-build -o 0 -q INFERSEQ_DATABASE INFERSEQ_DATABASE
You can then run the command as:
mgefinder inferseq-database PAIRSFILE INFERSEQ_DATABASE
Additional parameters include:
--min_perc_identity, -minident FLOAT
--max_internal_softclip_prop, -maxclip FLOAT
--max_edge_distance, -maxedgedist INTEGER
--keep-intermediate / --no-keep-intermediate
--output_file, -o TEXT
The parameter min_perc_identity
takes an float between 0 and 1. When aligning candidate insertion termini to the
query sequence database, it will only consider alignments that exceed this percentage identity with the reference.
The default for this parameter is 0.90, which is more lenient than the cutoff used in the inferseq-reference
and
inferseq-assembly
commands.
The parameter max_internal_softclip_prop
takes an float between 0 and 1. This is an additional candidate insertion terminus alignment filter. If the aligned termini are internally clipped by a proportion of their total length that exceeds this number, then the alignment is excluded. See "inferseq-reference
: Input and parameters" for more details on how this parameter works.
The parameter max_edge_distance
takes an integer. This determines how close the aligned termini must be to the edge of the sequence in order to keep the alignments (See Figure d above). The default for this parameter is 10.
The --keep-intermediate/--no-keep-intermediate
will determine whether or not the intermediate alignment file will be deleted or kept, which can be useful for debugging purposes.
This sequence inference approach assumes that the inserted sequences of interest exist within the query database itself. It aligns terminus pairs to this database, and determines whether or not they align to a given a sequence. Aligned termini that do not meet the min_perc_identity
cutoff are removed, and only inferred sequences where both terminus pairs align to the edge of the sequence of interest are kept, where the aligned termini are allowed to be within max_edge_distance
base pairs of the end of the query sequence.
By default, inferseq-database
will write to a file named mgefinder.inferseq_database.tsv
(The name can be changed with the --output_file
parameter). The columns of this file are the same as those found in the
"inferseq-reference
: Output file format" section above, but with the following differences:
-
method
- The method used to infer the sequence which will always beinferrred_daatabase
when using theinferseq-database
command.
All sequences in the database that are matched equal matches for a given terminus pair will be returned.
This is a valuable approach to taken when information about the inserted elements is already known beforehand. This database can include several insertion sequences that are relevant to the species under study, for example. It is quite sensitive, provided that read coverage is high enough (we recommend above 40x) and the inserted sequence of interest exists in the query database.
This command is intended to prepare SAM files for analysis by mgefinder as input. While this step is not strictly
necessary under all circumstances, some features of mgefinder require it (such as the experimental extendpairs
command). We recommend that all SAM/BAM files are processed using this command prior to further analysis if it is
computationally feasible.
This command requires only the alignment file SAM/BAM of interest. This is a file thatyou wish to analyze using the find
command in mgefinder. You can run this command as follows:
mgefinder formatbam IN_SAM OUT_BAM
Where IN_SAM
is the input SAM/BAM file of interest, and OUT_BAM
is the name of the output file.
Additional parameters include --single-end
, which is used when the file is a single-end and not paired-end, and
--keep-tmp-files
which will keep intermediate files without rather than deleting them.
This command first removes secondary alignments from the alignment file. It then formats the file for use by mgefinder, in particular it stores information about each reads mate so that it is immediately accessible, a step that is necessary when using the experimental extendpairs
command. It will then sort and index the BAM files.
The output of this command is just a BAM file that is ready to be analyzed by mgefinder.
The recall
command is used to collect information about potential insertion sites from a BAM file. For example,
if you have many insertion sites of interest, you may want to determine if these sites were actually deleted in the reference genome of some of the isolates. This can be useful when comparing isolates to each other. This command was used in our own GWAS of antibiotic resistance in E. coli to determine if some of the insertion sites were deleted in some of the isolates. This can allow you to then filter out these unreliable insertion sites.
The recall
command takes as input the output of the pair
command, and the original BAM
file of the reads
aligned to the reference genome. The pair
output file actually only requires three columns:
contig
, pos_5p
, and pos_3p
. The recall
command will analyze the sites specified in these columns.
This command can be executed as:
mgefinder recall PAIRSFILE BAMFILE
Additional parameters include:
--min_alignment_quality, -minq
--min_alignment_inner_length, -minial
Which should be the same as the values used when originally running the find
command.
By default min_alignment_quality
is 20 and min_alignment_inner_length
is 21.
Recall works very similarly to the find algorithm, but it does not attempt to reconstruct consensus termini.
It's only goal is to gather read information about the sites present in the PAIRSFILE
. This can then be used in other analyses.
The output format is quite similar to the output of the find
command. Each row is a specific site (note that
it splits each pair into its individual insertion sites). It includes columns about read counts, such as
softclip_count_5p
, softclip_count_3p
, runthrough_count
, small_insertion_count_5p
, small_insertion_count_3p
,
deletion_count
, upstream_deletion_count
, downstream_deletion_count
, and total_count
. See the
section "find
: Output file format" for more details on what each of these columns means.
- IMPORTANT UPDATES
- Special note on site-specific integrative mobile elements
- Publication & Presentation
- Questions / Comments
- Installation Method 1: Installing MGEfinder from conda environment file
- Installation Method 2: MGEfinder Singularity Container