Skip to content

User Manual

Matthew Durrant edited this page Dec 19, 2023 · 27 revisions

This user manual is designed as a reference for users who have questions about mgefinder. It includes details about how to execute important commands, the parameters available, brief descriptions of how the command works, and descriptions of the output files.

MGEfinder commands

We now describe each of the mgefinder commands.

workflow

This command strings together multiple mgefinder commands into a single snakemake workflow that can be parallelized. If you have multiple files that you want to analyze, this command is typically the best way to do so. If you want to learn more about snakemake, please read here.

workflow: Input and parameters

The input to this command is a working directory that includes all of the necessary files in a specific directory structure. For a specific example, please see the tutorial. Here is the general directory structure:

test_workdir/
├── 00.assembly/
│   ├── <sample1>.fna
│   ├── <sample2>.fna
│   ├── <sample3>.fna
│   └── ...
├── 00.bam/
│   ├── <sample1>.<genome>.bam
│   ├── <sample1>.<genome>.bam.bai
│   ├── <sample2>.<genome>.bam
│   ├── <sample2>.<genome>.bam.bai
│   ├── <sample3>.<genome>.bam
│   ├── <sample3>.<genome>.bam.bai
│   └── ...
└── 00.genome/
    ├── <genome>.fna
    └── ...

Be sure to keep the file naming consistent with what is shown above, including directory names and suffixes. refers to the name of the genome to which the sample was aligned. Make sure that the only periods in your file names are the ones shown above.

The command can be run as:

mgefinder workflow denovo WORKDIR

to run the de novo discovery workflow, or:

mgefinder workflow database WORKDIR

To run it in database-only mode, when assemblies of the isolates are not available or you want to focus on just a few elements of interest.

You can also run it with a custom snakefile and configfile using:

mgefinder workflow custom WORKDIR

Additional parameters include:

--snakefile, -s TEXT
--configfile, -c TEXT
--cores, -t INTEGER
--memory, -m INTEGER
--unlock / --no-unlock
--rerun-incomplete / --no-rerun-incomplete
--keep-going / --no-keep-going
--sensitive / --original

The parameter --snakefile takes a text string specifying the path to the Snakefile that you want to use. This is only available when running mgefinder workflow custom.

The parameter --configfile takes a text string specifying the path to the config file that you want to use. This is only available when running mgefinder workflow custom.

The parameter --cores takes an integer and specifies the number of available cores for parallelizing those commands that can be parallelized. The default is 1. We recommend using as many cores as you have available.

The parameter --memory takes an integer and species the memory limit in megabytes. The default is 16000, you can change this amount to fit your machine's available memory.

The parameter --unlock works the same way as the --unlock parameter provided to Snakemake. It unlocks a working directory that was locked due to some error. This is off by default.

The parameter --rerun-incomplete works the same as the Snakemake parameter. When this is selected, the rules used to generate files that are found to be incomplete will be rerun to produce complete files. This is off by default.

The parameter --keep-going works the same as the Snakemake parameter. When this is selected, parallel jobs will continue as far as they can even if one of them fails. This is off by default.

The parameter --sensitive / --original works is used to indicate the config file that is used when running the workflow. In --sensitive mode, parameters are adjusted to make it easier to detect insertions that cause direct repeats that reach up to 50 bp in length. This will most likely result in false positives, and it could actually slightly reduce sensitivity for elements with small direct repeats. In --original mode, the workflow runs with the original settings as they were specified and validated in the original study.

workflow: Description of implementation

This command works by executing a snakemake workflow using the provided workflow directory. This includes calling several other mgefinder commands in succession to produce a complete analysis. To understand individual steps in the workflow, read their respective descriptions in this manual.

workflow: Output file format

This workflow generates three more directories in the provided working directory:

01.mgefinder
02.database
03.results

The 01.mgefinder directory contains intermediate MGEfinder files that may be of interest to individuals with more advanced knowledge of how the tool works. This includes the output of the find, pair, and inferseq commands.

The 02.database directory contains a dynamically-constructed database of elements that were identified when running MGEfinder. These are necessary for the inferseq-database commands. This file may be of value to the user, depending on the use case.

Most of the files of value can be found in the 03.results directory. They will be named according to the following format:

03.results/
    └── <genome>/
        ├── 01.clusterseq.<genome>.tsv
        ├── 02.genotype.<genome>.tsv
        ├── 03.summarize.<genome>.clusters.tsv
        ├── 03.summarize.<genome>.groups.tsv
        ├── 04.makefasta.<genome>.all_seqs.fna
        └── 04.makefasta.<genome>.repr_seqs.fna

The file 01.clusterseq.<genome>.tsv a complete list of all inferred sequences for all ten files. This file includes the sample, pair_id (refers to results of the pair output files), method (sequence inference method, such as inferred_assembly_with_full_context), loc (the location of where the sequence was found, either in the reference genome, assembly, or dynamically-constructed database), the inferred_seq_length, the seqid (an identifier for each unique sequence), cluster (an identifier for the sequence cluster), group (an identifier for the cluster group), and the inferred_seq (nucleotide sequence of the element).

The file 02.genotype.<genome>.tsv includes insertion genotypes for all of the isolates. This includes the sample, the position of the insertion within the reference genome, the seqid, cluster, and group that indicates the identity of the inserted sequence. The final column indicates the conf (confidence) level of the insertion. This refers primarily to how the identity of the sequence was inferred. The highest confidence inferred sequence is IAwFC, which indicates that the insertion was found in the expected location in the assembly. The degree of confidence that you should accept depends largely on the use case. If, for example, you are performing a re-sequencing experiment, it may be sufficient to use the IDB column, which includes sequences inferred from the reference genome and/or the dynamically-constructed database.

The files 03.summarize.<genome>.clusters.tsv and 03.summarize.<genome>.groups.tsv provide summary statistics for each of the sequence clusters and groups, respectively. This can be used to perform further QC on the identified elements, and to stratify elements by their transposability. Some important columns include num_unique_sites_all, which refers to the number of unique sites in the reference genome where sequence cluster was found at least once in at least one isolate. and num_unique_sites_unambig, which refers to the number of unique sites where the cluster is the unambiguously assigned element (often two clusters will be assigned to a single site when the exact element cluster cannot be identified). See the manual for more details on these output files.

The files 04.makefasta.<genome>.all_seqs.fna and 04.makefasta.<genome>.repr_seqs.fna include FASTA files of all identified elements found in the genotype and summarize files. The 04.makefasta.<genome>.all_seqs.fna includes all unique sequences identified (even if they differ by a single base pair), and 04.makefasta.<genome>.repr_seqs.fna includes the representative sequence for each element cluster.

getworkflow

This is a very simple command provided for convenience. It just copies the default Snakefiles and config files for the workflow commands to the current working directory. These files can then be edited as the user wishes, and then they can be used with the mgefinder workflow custom command and the --snakefile and --configfile parameters. For example, if you want to run the mgefinder workflow custom command but use different parameters for individual rules, you can edit the configfiles provided with the getworkflow command.

To execute the command, simply type

mgefinder getworkflow

And the Snakefiles and config files will be copied to the current directory.

find

find image

This command finds insertion sites and reconstructs the termini of inserted sequence.

find: Input and parameters

The find command takes a BAM file as input. YOU MUST USE BWA MEM TO GENERATE THIS BAM FILE. We have found that other aligners, such as bowtie2, are not compatible with MGEfinder.

You can run the command as

mgefinder find BAMFILE

Additional parameters include:

--min_softclip_length, -minlen INTEGER
--min_softclip_count, -mincount INTEGER
--min_alignment_quality, -minq INTEGER
--min_alignment_inner_length, -minial INTEGER
--min_distance_to_mate, -mindist INTEGER
--min_softclip_ratio, -minratio FLOAT
--max_indel_ratio, -maxir FLOAT
--large_insertion_cutoff, -lins INTEGER
--min_count_consensus, -mcc INTEGER
--sample_id, -id TEXT
--output_file, -o TEXT

The parameter min_softclip_length takes an integer. For a softclipped site to be considered, there must be at least one softclipped read of this length. The default used is 8.

The parameter min_softclip_count takes an integer. It requires that for a clipped site to be considered it must be supported by at least this number of reads. The default used is 2.

The parameter min_alignment_quality takes an integer. This determines the minimum mapping quality of each alignment in order for it to be considered. The default used is 20.

The parameter min_alignment_inner_length takes an integer. It filters out reads that are softclipped on both ends, and where the inner aligned portion is shorter than this length. Filtering out such reads removes a lot of potential false positives. The default is 21, if this default is changed to include insertions that create larger direct repeats, you should also increase the min_distance_to_mate parameter to be min_alignment_inner_length + 1, the pair command parameter min_alignment_inner_length must match the find min_alignment_inner_length parameter, and pair command parameter max_direct_repeat_length should equal min_alignment_inner_length - 1. If none of your reads are able to span this distance when aligning to your reference genome, then all reads will be excluded.

The parameter min_distance_to_mate takes an integer. It filters out clipped-sites that have no other candidate terminus pair nearby. In the pair step, we only pair termini with each other if they are within 20 bp of each other. Filtering out sites without a candidate pair can dramatically speed up the analysis. However, if you want to keep all candidate insertion sites, you can increase this number to some very high integer. The default is 22.

The parameter min_softclip_ratio takes a float between 0 and 1.0. It sets a minimum for the ratio of clipped reads to all reads at a given site. This filters out very small indels that could disrupt the terminus-pairing step. The default used is 0.15.

The parameter max_indel_ratio takes a float between 0 and 1.0. For a softclipped site to be considered, the of small insertions/deletions (as defined by large_insertion_cutoff) at this site must below this value. The default used is 0.03.

The parameter large_insertion_cutoff takes an integer. Some reads are large enough to span both sides of an indel. For example, a 300 bp read may contain a 20 bp deletion relative to the reference. This parameter determines what should be considered an indel, and thus ignored, and what should be considered a large insertion and thus kept. The default is 30, meaning that is larger than 30 base pairs will be considered a large insertion.

The parameter min_count_consensus takes an integer. When generating a consensus sequence for the insertion terminus, it requires that each site must have at least this many reads supporting it. The default used is 2.

The parameter sample_id takes a text string. This is used to specify an ID to use for this sample. the default is the absolute path to the BAM file.

The parameter output_file takes a text string. This is the final file where the find results will be saved. The default is mgefinder.find.tsv.

find: Description of implementation

The find algorithm works by identifying candidate insertion sites by searching for clipped-end sites in locally aligned reads. To generate a consensus sequence of the candidate terminus, we use a trie-based approach intended to filter out spurious reads when building a consensus sequence. This also allows our algorithm to distinguish between multiple insertions at a single site, which may be observed in a metagenomic sample, for example (analysis of metagenomic samples is in development).

Briefly, we build a sequence Trie of all of the reads found clipped at a given site. We then generate all of the unique, complete sequences by traversing the trie. We then cluster these sequences together by sequence similarity, and process the different clusters separately.

We then generate a consensus sequence for a given insertion termines by traversing down the trie, choosing the base with the most support at each step. Evidence for a given base is calculated as the sum of all base-pair quality scores originally reported in the FASTQ file.

The consensus sequence ends when the number of reads supporting a position drops below --min_count_consensus.

find: Output file format

By default, find will write to a file named mgefinder.find.tsv (The name can be changed with the --output_file parameter). The columns of this file are described as follows:

  1. sample - A user-specified field used to identify the sample.
  2. terminus_id - An arbitrary identifier used internally by mgefinder.
  3. contig - The name of the contig where the terminus was identified.
  4. pos - The exact base pair position of where the clipped-ends begin, the presumed start of the insertion. This location is 0-based with respect to the reference genome used.
  5. orient - The orientation of the terminus. 5p refers to the 5' end of an insertion, which suggests this terminus runs from the 5' to the 3' direction with respect to the reference genome. 3p refers to the 3' end of an insertion, which suggests the terminus runs from the 3' to the 5' direction with respect to the reference genome.
  6. softclip_count_5p - The total number of reads clipped from the 5' to 3' direction at this site.
  7. softclip_count_3p - The total number of reads clipped from the 3' to 5' direction at this site.
  8. runthrough_count - The total number of reads that run through the insertion site without being clipped.
  9. small_insertion_count_5p - The total number of reads that include a small insertion from the 5' to 3' direction at this site. If this number is high, it indicates that the detected insertion is most likely a small indel. find uses this to filter out false positives.
  10. small_insertion_count_3p - Same as above, but for small insertions running from the 3' to the 5' direction.
  11. deletion_count - The number of reads with a deletion at the insertion site, also used to filter out false positives.
  12. upstream_deletion_count - The number of reads that contain a deletion one base pair upstream of this insertion site. Also useful for removing false positives
  13. downstream_deletion_count - Same as above, but for reads with deletions one base pair downstream of this insertion site.
  14. total_count - The sum of the softclip_count_5p, softclip_count_3p, runthrough_count, small_insertion_count_5p, small_insertion_count_3p, and deletion_count columns.
  15. consensus_softclip_count - The number of clipped reads at this site that specifically support the consensus terminus sequence on this line. This will be used as the softclip_count_5p and softclip_count_3p columns in the file generated by running the pair command.
  16. consensus_seq - The consensus sequence for the insertion terminus identified at this site.

This intermediate file is not particularly valuable to users on its own, and it is primarily intended to be used internally by mgefinder in the next step.

pair

This command pairs insertion termini with each other to represent the 5' and 3' termini of a candidate insertion.

pair: Input and parameters

You can run the pair command as

mgefinder pair FINDFILE BAMFILE REFERENCE_GENOME

Where FINDFILE is the output of the find command, BAMFILE is the binary sequence alignment file used as input for find, and REFERENCE_GENOME is the reference genome that BAMFILE was aligned to.

Additional parameters include:

--max_direct_repeat_length, -maxdr INTEGER
--min_alignment_quality, -minq INTEGER
--min_alignment_inner_length, -minial INTEGER
--max_junction_spanning_prop, -maxjsp FLOAT
--large_insertion_cutoff, -lins INTEGER
--output_file, -o TEXT

The parameter max_direct_repeat_length takes an integer. This specifies the maximum distance that oppositely-oriented insertion termini can be from each other in order to consider pairing them together. Since insertions often cause direct repeats at the insertion site, the location of the insertion termini are often separated by several base pairs. The default for this parameter is 20. This means that our algorithm by default will NOT identify true insertions that create direct repeats that exceed 20 base pairs. If you want to increase this parameter, you must also change the find parameter min_distance_to_mate to be min_alignment_inner_length + 1, the pair command parameter min_alignment_inner_length must match the find min_alignment_inner_length parameter, and pair command parameter max_direct_repeat_length should equal min_alignment_inner_length - 1.

The parameter min_alignment_quality takes an integer. This determines the minimum mapping quality of each alignment in order for it to be considered. The default used is 20. This should be the same as the min_alignment_quality parameter in find.

The parameter min_alignment_inner_length takes an integer. It filters out reads that are softclipped on both ends, and where the inner aligned portion is shorter than this length. Filtering out such reads removes a potential false positives. The default is 21, this parameter should not be changed, and it should be the same as the value of --min_alignment_inner_length used in find. If you want to increase this parameter, you must also change the find parameter min_distance_to_mate to be min_alignment_inner_length + 1, the pair command parameter min_alignment_inner_length must match the find min_alignment_inner_length parameter, and pair command parameter max_direct_repeat_length should equal min_alignment_inner_length - 1.

The parameter max_junction_spanning_prop takes a decimal between 0 and 1. This is used to filter out low-confidence insertion pairs, or insertions that are occurring in duplicated regions. We expect that very few reads will fully span the insertion junction, as that would indicate that the insertion site does not contain the insertion somewhere in the isolate. If the number of reads spanning the insertion junction without being clipped exceeds this proportion of the total reads at the site, then it will be ignored. By default, this parameter is 0.15.

The parameter large_insertion_cutoff takes an integer. Some reads are large enough to span both sides of an indel. For example, a 300 bp read may contain a 20 bp deletion relative to the reference. This parameter determines what should be considered an indel, and thus ignored, and what should be considered a large insertion and thus kept. The default is 30, meaning that is larger than 30 base pairs will be considered a large insertion.

The parameter output_file takes a text string. This is the final file where the find results will be saved. The default is mgefinder.pair.tsv.

pair: Description of implementation

The pair command uses a variety of techniques to pair termini with each other. It first filters insertion termini by the --max_direct_repeat_length parameter. It then does pairwise comparisons between all nearby, oppositely-oriented insertion termini to determine if they share inverted repeats at their termini, a common feature of many prokaryotic insertion sequences. It then prioritizes all pairs by the following parameters, in order:

  1. The length of the inverted repeat identified, if any. Pairs with longer shared inverted repeats receive higher priority.
  2. The difference in the number of softclipped reads that support the insertion termini. Pairs that have a similar number of reads supporting each terminus are given higher priority.
  3. The difference in the length of the recovered insertion termini. If the two insertion termini have very similar lengths, they are given higher priority than pairs with disparate terminus lengths.

It then filters sites with too many junction-spanning reads (see max_junction_spanning_prop), and infers the identity of the direct repeat created by the insertion, if any. We realize that these are somewhat arbitrary filters, and we welcome suggestions from the users on how this might be improved.

pair: Output file format

By default, pair will write to a file named mgefinder.pair.tsv (The name can be changed with the --output_file parameter). The columns of this file are described as follows:

  1. sample - A user-specified field used to identify the sample, carried over from the FINDFILE.
  2. pair_id - An arbitrary identifier used to identify the candidate insertion. This is an important identifier that is used to relate the output of the pair command to the output of the various inferseq commands.
  3. contig - The name of the contig where the pair was identified.
  4. pos_5p - The exact base pair position of where the clipped-ends of the 5' insertion terminus begin, the presumed start of the insertion. This location is 0-based with respect to the reference genome used.
  5. pos_3p - The exact base pair position of where the clipped-ends of the 3' insertion terminus begin, the presumed end of the insertion. This location is 0-based with respect to the reference genome used.
  6. softclip_count_5p - The number of clipped reads at this site that support the 5' insertion terminus. This is taken directly from the consensus_softclip_count of the find output file.
  7. softclip_count_3p - The number of clipped reads at this site that support the 3' insertion terminus. This is also taken directly from the consensus_softclip_count of the find output file.
  8. total_count_5p - The total number of reads found at the insertion site for the 5' terminus.
  9. total_count_3p - The total number of reads found at the inseriton site for the 3' terminus.
  10. spanning_count - The total reads that are found to span the 5' and 3' insertion sites by 10 bp in both directions.
  11. has_IR - A boolean True/False indicating whether or not the terminus pair contains inverted repeats at their end, a common feature of many inserted elements.
  12. IR_length - The length of the detected inverted repeat.
  13. IR_5p - The inverted repeat in the 5' terminus.
  14. IR_3p - The inverted repeat in the 3' terminus.
  15. seq_5p - The consensus sequence of the 5' terminus.
  16. seq_3p - The consensus sequence of the 3' terminus.
  17. direct_repeat_reference - The sequence of the direct repeat created, using the reference genome to infer this direct repeat.

This file will likely have intrinsic value to the user, as they indicate candidate insertions. However, these candidates must be further investigated before any final conclusions about their identity can be made. In some cases, for example, these may actually represent genomic inversions, not insertions. To infer the identity of these insertions, several inference approaches are implemented and described in the following sections.

inferseq-reference

inferseq reference image

This command infers the identity of insertions by aligning the termini of candidate pairs to a reference genome.

inferseq-reference: Input and parameters

The inferseq-reference command takes the output of the pair command and a reference genome as input.

While mgefinder will automatically index the reference genome if it is not already indexed, we recommend that you index the reference genome beforehand with the command:

bowtie2-build -o 0 -q INFERSEQ_REFERENCE INFERSEQ_REFERENCE

Where INFERSEQ_REFERENCE is the reference genome of interest.

BWA MEM should be used to generate this BAM file.

You can then run the command as

mgefinder inferseq-reference PAIRSFILE INFERSEQ_REFERENCE

Additional parameters include:

--min_perc_identity, -minident FLOAT
--max_internal_softclip_prop, -maxclip FLOAT
--max_inferseq_size, -maxsize INTEGER
--min_inferseq_size, -minsize INTEGER
--keep-intermediate/--no-keep-intermediate
--output_file, -o TEXT

The parameter min_perc_identity takes an float between 0 and 1. When aligning candidate insertion termini to the reference genome, it will only consider alignments that exceed this percentage identity with the reference. The default for this parameter is 0.95.

The parameter max_internal_softclip_prop takes an float between 0 and 1. This is an additional candidate insertion terminus alignment filter. If the aligned termini are internally clipped by a proportion of their total length that exceeds this number, then the alignment is excluded. For example, imagine the alignment:

-------->          <----------
terminus1            terminus2

Imagine the vertical bar | indicating that one of the alignments is clipped at a specific site, such as:

-------->         <---|-------
terminus1            terminus2

If the proportion of the clipped end of the terminus2 alignment exceeds 0.05 (by default), then this alignment will be excluded.

The parameter max_inferseq_size excludes inferred sequences that are larger than max_inferseq_size. The default for this parameter is 500 kilobase pairs, but we urge users to be cautious when working with very large insertions, as these may be false positives.

The parameter min_inferseq_size excludes inferred sequences that are smaller than min_inferseq_size. The default for this parameter is 1 base pair, but in reality mgefinder is not well suited to identify insertions that are smaller than the read length of the sequencing library.

The --keep-intermediate/--no-keep-intermediate will determine whether or not the intermediate alignment file will be deleted or kept, which can be useful for debugging purposes.

inferseq-reference: Description of implementation

A schematic of how this step is implemented is shown above, but more details are given here. First, if the reference genome is not already bowtie2 indexed, mgefinder will index the genome. Next, it will align the individual termini to the reference genome. It will then filter the alignments according to the min_perc_identity parameter, and it will also filter alignments that are clipped to on their outer edge. It will then iterate through the sorted alignments and pair termini with each other. This approach will actually take the minimum non-overlapping alignment pairs, and ignore all other combinations of alignments, as described in the schematic below:

min_overlap image

Next, candidate pairs will be filtered according to the max_internal_softclip_prop parameter, and the inferred sequence size filters (max_inferseq_size and min_inferseq_size). Finally using the sum of the alignment scores of both termini in each pair, we filter out all pairs that do not have the maximum alignment score. This leaves us with a set of final inferred sequences of equal quality.

inferseq-reference: Output file format

By default, inferseq-reference will write to a file named mgefinder.inferseq_reference.tsv (The name can be changed with the --output_file parameter). The columns of this file are described as follows:

  1. sample - A user-specified field used to identify the sample, carried over from the PAIRSFILE.
  2. pair_id - An identifier used to map the inferred sequence in this file to the candidate insertion pairs described in the pair file. That is, all lines in the mgefinder.inferseq_reference.tsv with the value 20 in the pair_id column represent an inferred sequence identified by the inferseq_reference command that corresponds to the candidate insertion pair identified in the pair file that is labeled with 20 in the pair_id column.
  3. method - The method used to infer the sequence, which will always be inferred_reference when running the inferseq-reference command.
  4. loc - The location of the inferred sequence as it was found in the reference genome. This is in the format CONTIG:START-END where CONTIG is the contig where the inferred sequence was located, START is the base pair start of the inferred sequence, and END is the base pair end of the inferred sequence.
  5. inferred_seq_length - The length of the inferred sequence.
  6. inferred_seq - The identity of the inferred sequence in full.

This file is of value to the user, as it infers the full identity of the candidate insertions by referring to a reference genome. In certain situations this may be sufficient, such as when the sequenced isolate differs only slightly from the reference genome (like in a resequencing experiment). But if the inserted element may not exist in an available reference genome, alternative inference approaches may be of interest (see below).

It should also be noted that the reference genome used here does not necessarily have to be the reference genome that the sequencing reads were aligned to.

inferseq-assembly

inferseq assembly image

This command infers the identity of insertions by aligning the termini of candidate pairs to a sequence assembly of the isolate of interest. We recommend using SPAdes to assemble the genome of the isolate of interest.

inferseq-assembly: Input and parameters

The inferseq-assembly command takes as input the output of the pair command, the BAM file of the sequencing reads aligned to the reference genome, an assembly file in FASTA format of the sequencing reads derived from the isolate of interest, and the reference genome used when initially aligning the isolates reads. In this step, it is imperative that the sequence assembly comes from the isolate of interest.

While mgefinder will automatically index the sequence assembly if it is not already indexed, we recommend that you index the assembly beforehand with the command:

bowtie2-build -o 0 -q INFERSEQ_ASSEMBLY INFERSEQ_ASSEMBLY

You can then run the command as

mgefinder inferseq-assembly PAIRSFILE BAMFILE INFERSEQ_ASSEMBLY INFERSEQ_REFERENCE

Additional parameters include:

--min_perc_identity, -minident FLOAT
--max_internal_softclip_prop, -maxclip FLOAT
--max_inferseq_size, -maxsize INTEGER
--min_inferseq_size, -minsize INTEGER
--keep-intermediate/--no-keep-intermediate
--output_file, -o TEXT

These additional parameters are identical in function to the parameters described in the "inferseq-reference: Input and parameters" section above. Please refer to this section for more details.

inferseq-assembly: Description of implementation

Details about how this step is implemented are shown in Figures a and b above. The sequence of steps is very similar to those described in the "inferseq-reference: Description of implementation" section above, with important differences. Two alignments to the assembly are performed. In the first step, 25 base pairs of reference genome are appended to the beginning of the termini before alignment (See Figure a above). This allows us to determine if the insertion has assembled within the expected sequence context, which gives us greater confidence that the inferred identity of the insertion is of high quality. Then the termini are aligned without the appended context sequence to infer the identity of the insertion. Both of these types of inferred sequences are then reported in the output file.

inferseq-assembly: Output file format

By default, inferseq-assembly will write to a file named mgefinder.inferseq_assembly.tsv (The name can be changed with the --output_file parameter). The columns of this file are the same as those found in the "inferseq-reference: Output file format" section above, but with the following differences:

  1. method - The method used to infer the sequence. With the inferseq-assembly command, this will be one of three different values: 1) inferred_assembly_with_full_context, meaning that the inserted was inferred from the sequence assembly with the 25 base pairs of context sequence appended on both sides, the highest quality inferred sequence. 2) inferred_assembly_with_half_context means that the sequence was inferred in context, but one side of the inserted element was found at the end of a contig, meaning that the sequence context of one side is still ambiguous. 3) inferred_assembly_without_context indicates the the sequence was inferred from the assembly, but without the additional sequence context. This is common when the inserted sequence is a large repeated element and it cannot be properly assembled.
  2. loc - The location of the inferred sequence as it was found in the sequence assembly. This is in the format CONTIG:START-END where CONTIG is the contig where the inferred sequence was located, START is the base pair start of the inferred sequence, and END is the base pair end of the inferred sequence.

This file is of value to the user, as it infers the full identity of the candidate insertions by referring to sequence assembly. This is of value when the sequenced isolate and the reference genome used are assumed to be quite different from each other. It is limited, however, by the quality of the genome assembly. Large mobile genetic elements that cannot be properly assembled will be missed.

inferseq-overlap

In certain situations, the identity of the inserted sequence can be inferred by attempting to overlap the termini. For example:

terminus1 ----------------->
                       ||||
                      <-------------------terminus2

In the situation depicted above, terminus1 and terminus2 overlap, and they can be merged into a single insertion.

This command performs this merging operation, if possible.

inferseq-overlap: Input and parameters

The inferseq-overlap command takes as input only the output of the pair command.

You can run the command as

mgefinder inferseq-overlap PAIRSFILE

Additional parameters include:

--min_overlap_score, -minscore INTEGER
--min_overlap_perc_identity, -minopi FLOAT
--minsize, -min_inferseq_size FLOAT
--output_file, -o TEXT

The parameter min_overlap_score determines the minimum alignment score necessary for two termini to be merged. Matched positions increase the score by 1, and mismatch positions decrease the score by 1.

The parameter min_overlap_perc_identity is the minimum sequence identity between the overlapping portion of the two termini that is required for them to merge.

The parameter min_inferseq_size is the minimum overlapped consensus sequence size that will be kept in the final output, by default the parameter is 30, and anything smaller than this size will be excluded from the results.

inferseq-overlap: Description of implementation

The merging procedure taken here is quite simple. Termini are scanned across each other one base pair at a time. The overlapping portions are checked for matching base pairs, with matches increasing the overlap score by one, and mismatches decreasing the score by one. The parameters min_overlap_score and min_overlap_perc_identity are used to determine if the termini are merged into a single sequence.

inferseq-overlap: Output file format

By default, inferseq-overlap will write to a file named mgefinder.inferseq_overlap.tsv (The name can be changed with the --output_file parameter). The columns of this file are the same as those found in the "inferseq-reference: Output file format" section above, but with the following differences:

  1. method - The method used to infer the sequence which will always be inferrred_overlap when using the inferseq-overlap command.
  2. loc - The portions of terminus1 and terminus2 that are kept to form the final merged sequence. This is in the form seq_5p:START_5p-END_5p;seq_3p:START_3p-END_3p.

This is valuable to the user primarily for filtering out insertions that are smaller in size, as this step is fundamentally limited by the read length of the library. Mustache is not well-suited to identify smaller insertions, and this step can be used to filter out these small insertions so they do not interfere with downstream analyses.

inferseq-database

inferseq database image

This command infers the identity of insertions by aligning the termini of candidate pairs to a database of query sequences. Ideally, this database is filtered to remove redundant sequences, which may dramatically reduce the amount of time required to run this step.

inferseq-database: Input and parameters

The inferseq-database command takes as input the output of the pair command, and a FASTA file of the query sequences of interest.

While mgefinder will automatically index the query sequence database if it is not already indexed, we recommend that you index the database beforehand with the command:

bowtie2-build -o 0 -q INFERSEQ_DATABASE INFERSEQ_DATABASE

You can then run the command as:

mgefinder inferseq-database PAIRSFILE INFERSEQ_DATABASE

Additional parameters include:

--min_perc_identity, -minident FLOAT
--max_internal_softclip_prop, -maxclip FLOAT
--max_edge_distance, -maxedgedist INTEGER
--keep-intermediate / --no-keep-intermediate
--output_file, -o TEXT

The parameter min_perc_identity takes an float between 0 and 1. When aligning candidate insertion termini to the query sequence database, it will only consider alignments that exceed this percentage identity with the reference. The default for this parameter is 0.90, which is more lenient than the cutoff used in the inferseq-reference and inferseq-assembly commands.

The parameter max_internal_softclip_prop takes an float between 0 and 1. This is an additional candidate insertion terminus alignment filter. If the aligned termini are internally clipped by a proportion of their total length that exceeds this number, then the alignment is excluded. See "inferseq-reference: Input and parameters" for more details on how this parameter works.

The parameter max_edge_distance takes an integer. This determines how close the aligned termini must be to the edge of the sequence in order to keep the alignments (See Figure d above). The default for this parameter is 10.

The --keep-intermediate/--no-keep-intermediate will determine whether or not the intermediate alignment file will be deleted or kept, which can be useful for debugging purposes.

inferseq-database: Description of implementation

This sequence inference approach assumes that the inserted sequences of interest exist within the query database itself. It aligns terminus pairs to this database, and determines whether or not they align to a given a sequence. Aligned termini that do not meet the min_perc_identity cutoff are removed, and only inferred sequences where both terminus pairs align to the edge of the sequence of interest are kept, where the aligned termini are allowed to be within max_edge_distance base pairs of the end of the query sequence.

inferseq-database: Output file format

By default, inferseq-database will write to a file named mgefinder.inferseq_database.tsv (The name can be changed with the --output_file parameter). The columns of this file are the same as those found in the "inferseq-reference: Output file format" section above, but with the following differences:

  1. method - The method used to infer the sequence which will always be inferrred_daatabase when using the inferseq-database command.

All sequences in the database that are matched equal matches for a given terminus pair will be returned.

This is a valuable approach to taken when information about the inserted elements is already known beforehand. This database can include several insertion sequences that are relevant to the species under study, for example. It is quite sensitive, provided that read coverage is high enough (we recommend above 40x) and the inserted sequence of interest exists in the query database.

Additional/ Experimental mgefinder Commands

formatbam

This command is intended to prepare SAM files for analysis by mgefinder as input. While this step is not strictly necessary under all circumstances, some features of mgefinder require it (such as the experimental extendpairs command). We recommend that all SAM/BAM files are processed using this command prior to further analysis if it is computationally feasible.

formatbam: Input and parameters

This command requires only the alignment file SAM/BAM of interest. This is a file thatyou wish to analyze using the find command in mgefinder. You can run this command as follows:

mgefinder formatbam IN_SAM OUT_BAM

Where IN_SAM is the input SAM/BAM file of interest, and OUT_BAM is the name of the output file.

Additional parameters include --single-end, which is used when the file is a single-end and not paired-end, and --keep-tmp-files which will keep intermediate files without rather than deleting them.

formatbam: Description of implementation

This command first removes secondary alignments from the alignment file. It then formats the file for use by mgefinder, in particular it stores information about each reads mate so that it is immediately accessible, a step that is necessary when using the experimental extendpairs command. It will then sort and index the BAM files.

formatbam: Output file format

The output of this command is just a BAM file that is ready to be analyzed by mgefinder.

recall

The recall command is used to collect information about potential insertion sites from a BAM file. For example, if you have many insertion sites of interest, you may want to determine if these sites were actually deleted in the reference genome of some of the isolates. This can be useful when comparing isolates to each other. This command was used in our own GWAS of antibiotic resistance in E. coli to determine if some of the insertion sites were deleted in some of the isolates. This can allow you to then filter out these unreliable insertion sites.

recall: Input and parameters

The recall command takes as input the output of the pair command, and the original BAM file of the reads aligned to the reference genome. The pair output file actually only requires three columns: contig, pos_5p, and pos_3p. The recall command will analyze the sites specified in these columns.

This command can be executed as:

mgefinder recall PAIRSFILE BAMFILE

Additional parameters include:

--min_alignment_quality, -minq
--min_alignment_inner_length, -minial

Which should be the same as the values used when originally running the find command. By default min_alignment_quality is 20 and min_alignment_inner_length is 21.

recall: Description of implementation

Recall works very similarly to the find algorithm, but it does not attempt to reconstruct consensus termini. It's only goal is to gather read information about the sites present in the PAIRSFILE. This can then be used in other analyses.

recall: Output file format

The output format is quite similar to the output of the find command. Each row is a specific site (note that it splits each pair into its individual insertion sites). It includes columns about read counts, such as softclip_count_5p, softclip_count_3p, runthrough_count, small_insertion_count_5p, small_insertion_count_3p, deletion_count, upstream_deletion_count, downstream_deletion_count, and total_count. See the section "find: Output file format" for more details on what each of these columns means.

Back to main page

Clone this wiki locally