title | subtitle | date | output | header-includes | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Manta User Guide |
(Document from [Manta GitHub repository](https://github.com/Illumina/manta)) |
2021 12_Dec 02 |
|
|
- Introduction
- Installation
- Method Overview
- Capabilities
- Input requirements
- Outputs
- Runtime hardware requirements
- Run configuration and execution
Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow. The method is designed for rapid analysis on standard compute hardware: NA12878 at 50x genomic coverage is analyzed in less than 20 minutes on a 20 core server, and most WGS tumor/normal analyses can be completed within 2 hours. Manta combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise. It provides scoring models for germline variants in small sets of diploid samples and somatic variants in matched tumor/normal sample pairs. There is experimental support for analysis of unmatched tumor samples as well (see details below). Manta accepts input read mappings from BAM or CRAM files and reports all SV and indel inferences in VCF 4.1 format.
Methods and benchmarking details are described in:
Chen, X. et al. (2016) Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics, 32, 1220-1222. doi:10.1093/bioinformatics/btv710
...and the corresponding open-access pre-print.
Please see the Manta installation instructions
Manta divides the SV and indel discovery process into two primary steps: (1) scanning the genome to find SV associated regions and (2) analysis, scoring and output of SVs found in such regions.
-
Build breakend association graph In this step the entire genome is scanned to discover evidence of possible SVs and large indels. This evidence is enumerated into a graph with edges connecting all regions of the genome which have a possible breakend association. Edges may connect two different regions of the genome to represent evidence of a long-range association, or an edge may connect a region to itself to capture a local indel/small SV association. Note that these associations are more general than a specific SV hypothesis, in that many breakend candidates may be found on one edge, although typically only one or two candidates are found per edge.
-
Analyze graph edges to find SVs The second step is to analyze individual graph edges or groups of highly connected edges to discover and score SVs associated with the edge(s). The substeps of this process include inference of SV candidates associated with the edge, attempted assembly of the SVs breakends, scoring/genotyping and filtration of the SV under various biological models (currently diploid germline and somatic), and finally, output to VCF.
Manta is capable of detecting all structural variant types which are identifiable in the absence of copy number analysis and large-scale de-novo assembly. Detectable types are enumerated further below.
For each structural variant and indel, Manta attempts to assemble the
breakends to basepair resolution and report the left-shifted breakend
coordinate (per the VCF 4.1 SV reporting guidelines), together
with any breakend homology sequence and/or inserted sequence between
the breakends. It is often the case that the assembly will fail to
provide a confident explanation of the data -- in such cases the
variant will be reported as IMPRECISE
, and scored according to the
paired-end read evidence only.
The sequencing reads provided as input to Manta are expected to be from a paired-end sequencing assay which results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.
Manta is primarily tested for whole-genome and whole-exome (or other targeted enrichement) sequencing assays on DNA. For these assays the following applications are supported:
- Joint analysis of small sets of diploid individuals (where 'small' means family-scale -- roughly 10 or fewer samples)
- Subtractive analysis of a matched tumor/normal sample pair
- Analysis of an individual tumor sample
For the first use case above, note that there is no specific restriction against using Manta for the joint analysis of larger cohorts, but this has not been extensively tested so there may be stability or call quality issues.
Per the final use case above, tumor samples can be analyzed without a matched normal sample. In this case no scoring function is available, but the supporting evidence counts and many filters can still be usefully applied.
RNA-Seq analysis is still in development and not fully supported. It
can be configured with the --rna
flag. This will adjust filtration
levels and take other RNA-specific filtration and intron handling steps
(more details are provided further below).
Manta is able to detect all variation classes which can be explained as novel DNA adjacencies in the genome. Simple insertion/deletion events can be detected down to a configurable minimum size cutoff (defaulting to 8). All DNA adjacencies are classified into the following categories based on the breakend pattern:
- Deletions
- Insertions
- Fully-assembled insertions
- Partially-assembled (ie. inferred) insertions
- Inversions
- Tandem Duplications
- Interchromosomal Translocations
Manta should not be able to detect the following variant types:
- Dispersed duplications
- Most expansion/contraction variants of a reference tandem repeat
- Small inversions
- The limiting size is not tested, but in theory detection falls off below ~200bases. So-called micro-inversions might be detected indirectly as combined insertion/deletion variants.
- Fully-assembled large insertions
- The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but note that power to fully assemble the insertion should fall off to impractical levels before this size
- Note that manta does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.
More general repeat-based limitations exist for all variant types:
- Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
- Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.
Note that while Manta classifies novel DNA-adjacencies, it does not infer the higher level constructs implied by the classification. For instance, a variant marked as a deletion by manta indicates an intrachromosomal translocation with a deletion-like breakend pattern, however there is no test of depth, b-allele frequency or intersecting adjacencies to directly infer the SV type.
The sequencing reads provided as input to Manta are expected to be from a paired-end sequencing assay with an "innie" orientation between the two reads of each DNA fragment, each presenting a read from the outer edge of the fragment insert inward.
Manta can tolerate non-paired reads in the input, so long as sufficient paired-end reads exist to estimate the paired fragment size distribution. Non-paired reads will still be used in discovery, assembly and split-read scoring if their alignments (or SA tag split alignments) support a large indel or SV, or mismatch/clipping suggests a possible breakend location.
Manta requires input sequencing reads to be mapped by an external tool
and provided as input in either BAM or CRAM format. Each input file must be
coordinate sorted and indexed to produce asamtools/htslib
-style index in a
file named to match the input BAM or CRAM file with an additional '.bai', '.crai'
or '.csi' filename extension.
At configuration time, at least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file will be treated as a separate sample as part of a joint diploid sample analysis.
The following limitations exist on the input BAM or CRAM files provided to Manta:
- Alignments cannot have an unknown read sequence (SEQ="*")
- Alignments cannot contain the "=" character in the SEQ field.
- Alignments cannot use the sequence match/mismatch ("="/"X") CIGAR notation
- RG (read group) tags in the alignment records are ignored -- each file will be treated as representing one sample.
- Alignments with basecall quality values greater than 70 are rejected (these are not supported on the assumption that this indicates an offset error)
Manta also requires a reference sequence in fasta format. This must be
the same reference used for mapping the input alignment files. The reference
must include a samtools/htslib
-style index in a file named to match the
input fasta with an additional '.fai' file extension.
The primary Manta outputs are a set of VCF 4.1 files, found in
${MANTA_ANALYSIS_PATH}/results/variants
. Currently there are 3 VCF
files created for a germline analysis, and an additional somatic VCF
is produced for a tumor/normal subtraction. These files are:
- diploidSV.vcf.gz
- SVs and indels scored and genotyped under a diploid model for the set of samples in a joint diploid sample analysis or for the normal sample in a tumor/normal subtraction analysis. In the case of a tumor/normal subtraction, the scores in this file do not reflect any information from the tumor sample.
- somaticSV.vcf.gz
- SVs and indels scored under a somatic variant model. This file will only be produced if a tumor sample alignment file is supplied during configuration
- candidateSV.vcf.gz
- Unscored SV and indel candidates. Only a minimal amount of supporting evidence is required for an SV to be entered as a candidate in this file. An SV or indel must be a candidate to be considered for scoring, therefore an SV cannot appear in the other VCF outputs if it is not present in this file. Note that by default this file includes indels of size 8 and larger. The smallest indels in this set are intended to be passed on to a small variant caller without scoring by manta itself (by default manta scoring starts at size 50).
- candidateSmallIndels.vcf.gz
- Subset of the candidateSV.vcf.gz file containing only simple insertion and deletion variants less than the minimum scored variant size (50 by default). Passing this file to a small variant caller will provide continuous coverage over all indel sizes when the small variant caller and manta outputs are evaluated together. Alternate small indel candidate sets can be parsed out of the candidateSV.vcf.gz file if this candidate set is not appropriate.
For tumor-only analysis, Manta will produce an additional VCF:
- tumorSV.vcf.gz
- Subset of the candidateSV.vcf.gz file after removing redundant candidates and small indels less than the minimum scored variant size (50 by default). The SVs are not scored, but include additional details: (1) paired and split read supporting evidence counts for each allele (2) a subset of the filters from the scored tumor-normal model are applied to the single tumor case to improve precision.
Manta VCF output follows the VCF 4.1 spec for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The section below highlights some of the variant representation details and lists the primary VCF field values.
Sample names printed into the VCF output are extracted from each input alignment file from the first read group ('@RG') record found in the header. Any spaces found in the name will be replaced with underscores. If no sample name is found a default SAMPLE1, SAMPLE2, etc.. label will be used instead.
All variants are reported in the VCF using symbolic alleles unless
they are classified as a small indel, in which case full sequences are
provided for the VCF REF
and ALT
allele fields. A variant is
classified as a small indel if all of these criteria are met:
- The variant can be entirely expressed as a combination of inserted and deleted sequence.
- The deletion or insertion length is not 1000 or greater.
- The variant breakends and/or the inserted sequence are not imprecise.
When VCF records are printed in the small indel format, they will also
include the CIGAR
INFO tag describing the combined insertion and
deletion event.
Large insertions are reported in some cases even when the insert
sequence cannot be fully assembled. In this case Manta reports the
insertion using the <INS>
symbolic allele and includes the special
INFO fields LEFT_SVINSSEQ
and RIGHT_SVINSSEQ
to describe the
assembled left and right ends of the insert sequence. The following is
an example of such a record from the joint diploid analysis of
NA12878, NA12891 and NA12892 mapped to hg19:
chr1 11830208 MantaINS:1577:0:0:0:3:0 T <INS> 999 PASS END=11830208;SVTYPE=INS;CIPOS=0,12;CIEND=0,12;HOMLEN=12;HOMSEQ=TAAATTTTTCTT;LEFT_SVINSSEQ=TAAATTTTTCTTTTTTCTTTTTTTTTTAAATTTATTTTTTTATTGATAATTCTTGGGTGTTTCTCACAGAGGGGGATTTGGCAGGGTCACGGGACAACAGTGGAGGGAAGGTCAGCAGACAAACAAGTGAACAAAGGTCTCTGGTTTTCCCAGGCAGAGGACCCTGCGGCCTTCCGCAGTGTTCGTGTCCCTGATTACCTGAGATTAGGGATTTGTGATGACTCCCAACGAGCATGCTGCCTTCAAGCATCTGTTCAACAAAGCACATCTTGCACTGCCCTTAATTCATTTAACCCCGAGTGGACACAGCACATGTTTCAAAGAG;RIGHT_SVINSSEQ=GGGGCAGAGGCGCTCCCCACATCTCAGATGATGGGCGGCCAGGCAGAGACGCTCCTCACTTCCTAGATGTGATGGCGGCTGGGAAGAGGCGCTCCTCACTTCCTAGATGGGACGGCGGCCGGGCGGAGACGCTCCTCACTTTCCAGACTGGGCAGCCAGGCAGAGGGGCTCCTCACATCCCAGACGATGGGCGGCCAGGCAGAGACACTCCCCACTTCCCAGACGGGGTGGCGGCCGGGCAGAGGCTGCAATCTCGGCACTTTGGGAGGCCAAGGCAGGCGGCTGCTCCTTGCCCTCGGGCCCCGCGGGGCCCGTCCGCTCCTCCAGCCGCTGCCTCC GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:22,24:22,32 0/1:PASS:999:999,0,999:18,25:24,20 0/0:PASS:230:0,180,999:39,0:34,0
Inversions are reported as breakends by default. For a simple reciprocal inversion, four breakends will be reported, and they shall share the same EVENT
INFO tag. The following is an example of a simple reciptocal inversion:
chr1 17124941 MantaBND:1445:0:1:1:3:0:0 T [chr1:234919886[T 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:1:3:0:1;CIPOS=0,1;HOMLEN=1;HOMSEQ=T;INV5;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=254;BND_DEPTH=107;MATE_BND_DEPTH=100 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:65,8:15,51
chr1 17124948 MantaBND:1445:0:1:0:0:0:0 T T]chr1:234919824] 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:0:0:0:1;INV3;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=109;MATE_BND_DEPTH=83 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:60,2:0,46
chr1 234919824 MantaBND:1445:0:1:0:0:0:1 G G]chr1:17124948] 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:0:0:0:0;INV3;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=83;MATE_BND_DEPTH=109 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:60,2:0,46
chr1 234919885 MantaBND:1445:0:1:1:3:0:1 A [chr1:17124942[A 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:1:3:0:0;CIPOS=0,1;HOMLEN=1;HOMSEQ=A;INV5;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=254;BND_DEPTH=100;MATE_BND_DEPTH=107 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:65,8:15,51
A supplementary script, provided as $MANTA_INSTALL_FOLDER/libexec/convertInversion.py
, can be applied to Manta's output vcf files to reformat inversions into single inverted sequence junctions, which was the format used in Manta versions <= 1.4.0. Two INFO tags are introduced for such format: the INV3 tag indicates inversion breakends open at the 3' of reported location, whereas the INV5 tag indicates inversion breakends open at the 5' of reported location. More specifically, in the inversion exmaples illustrated at https://software.broadinstitute.org/software/igv/interpreting_pair_orientations, the INV5 tag corresponds to the IGV "RR"/dark blue reads, and the INV3 tag corresponds to the IGV "LL"/ light blue reads. This format was informative because single inverted junctions are often identified as part of a complex SV in real data, whereas simple reciprocal inversions are uncommon outside of simulated data. For a simple reciprocal inversion, both INV3 and INV5 junctions are expected to be reported, and they shall share the same EVENT
INFO tag. The following is the converted formant of the above example of a simple reciptocal inversion:
chr1 17124940 MantaINV:1445:0:1:1:3:0 C <INV> 999 PASS END=234919885;SVTYPE=INV;SVLEN=217794945;CIPOS=0,1;CIEND=-1,0;HOMLEN=1;HOMSEQ=T;EVENT=MantaINV:1445:0:1:0:0:0;JUNCTION_QUAL=254;INV5 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:65,8:15,51
chr1 17124948 MantaINV:1445:0:1:0:0:0 T <INV> 999 PASS END=234919824;SVTYPE=INV;SVLEN=217794876;EVENT=MantaINV:1445:0:1:0:0:0;JUNCTION_QUAL=999;INV3 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:60,2:0,46
ID | Description |
---|---|
IMPRECISE | Flag indicating that the structural variation is imprecise, i.e. the exact breakpoint location is not found |
SVTYPE | Type of structural variant |
SVLEN | Difference in length between REF and ALT alleles |
END | End position of the variant described in this record |
CIPOS | Confidence interval around POS |
CIEND | Confidence interval around END |
CIGAR | CIGAR alignment for each alternate indel allele |
MATEID | ID of mate breakend |
EVENT | ID of event associated to breakend |
HOMLEN | Length of base pair identical homology at event breakpoints |
HOMSEQ | Sequence of base pair identical homology at event breakpoints |
SVINSLEN | Length of insertion |
SVINSSEQ | Sequence of insertion |
LEFT_SVINSSEQ | Known left side of insertion for an insertion of unknown length |
RIGHT_SVINSSEQ | Known right side of insertion for an insertion of unknown length |
BND_DEPTH | Read depth at local translocation breakend |
MATE_BND_DEPTH | Read depth at remote translocation mate breakend |
JUNCTION_QUAL | If the SV junction is part of an EVENT (ie. a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only |
SOMATIC | Flag indicating a somatic variant |
SOMATICSCORE | Somatic variant quality score |
JUNCTION_SOMATICSCORE | If the SV junction is part of an EVENT (ie. a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only |
CONTIG | Assembled contig sequence, if the variant is not imprecise (with --outputContig ) |
ID | Description |
---|---|
GT | Genotype |
FT | Sample filter, 'PASS' indicates that all filters have passed for this sample |
GQ | Genotype Quality |
PL | Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification |
PR | Number of spanning read pairs which strongly (Q30) support the REF or ALT alleles |
SR | Number of split-reads which strongly (Q30) support the REF or ALT alleles |
ID | Level | Description |
---|---|---|
MinQUAL | Record | QUAL score is less than 20 |
MinGQ | Sample | GQ score is less than 15 |
MinSomaticScore | Record | SOMATICSCORE is less than 30 |
Ploidy | Record | For DEL & DUP variants, the genotypes of overlapping variants (with similar size) are inconsistent with diploid expectation |
MaxDepth | Record | Depth is greater than 3x the median chromosome depth near one or both variant breakends |
MaxMQ0Frac | Record | For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend exceeds 0.4 |
NoPairSupport | Record | For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample |
SampleFT | Record | No sample passes all the sample-level filters |
HomRef | Sample | Homozygous reference call |
As described above, there are two levels of filters: record level (FILTER) and sample level (FORMAT/FT). Record-level filters are generally independant to sample-level filters. However, if none of the samples passes all sample-level filters, the 'SampleFT' filter will be applied at the record level.
Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the
sample. Manta uses the INFO/EVENT
field to indicate that two or more such junctions are hypothesized to occur
together as part of a single variant event. All individual variant records belonging to the same event will share
the same INFO/EVENT
string. Note that although such an inference could be applied after SV calling by analyzing
the relative distance and orientation of the called variant breakpoints,
Manta incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale
events. Given that at least one junction in the event has already passed standard variant candidacy thresholds,
sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern
consistent with a multi-junction event (such as a reciprocal translocation pair).
Note that although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to 2. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.
The VCF ID or 'identifier' field can be used for annotation, or in the case of BND ('breakend') records for translocations, the ID value is used to link breakend mates or partners.
An example Manta VCF ID is "MantaINS:1577:0:0:0:3:0". The value provided in this field reflects the SV association graph edge(s) from which the SV or indel was discovered. The ID value provided by Manta is primarily intended for internal use by manta developers. The value is guaranteed to be unique within any VCF file produced by Manta, and these ID values are used to link associated breakend records using the standard VCF MATEID
key. The structure of this ID may change in the future, it is safe to use the entire value as a unique key, but parsing this value may lead to incompatibilities with future updates.
The exact meaning of the ID field for the current Manta version is described in the following section of the Manta developer guide.
It can sometimes be convenient to express structural variants in BEDPE
format. For such applications we recommend the script vcfToBedpe
available from:
https://github.com/ctsa/svtools
This repository is forked from @hall-lab with edits to support VCF 4.1 SV format and match Manta's portability contstaints.
Note that BEDPE format greatly reduces structural variant information compared to Manta's VCF output. In particular breakend orientation, breakend homology and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason we only recommend BEDPE as a temporary intermediate output for applications which require it.
Additional secondary output is provided in ${MANTA_ANALYSIS_PATH}/results/stats
- alignmentStatsSummary.txt
- fragment length quantiles for each input alignment file
- svLocusGraphStats.tsv
- statistics and runtime information pertaining to the SV locus graph
- svCandidateGenerationStats.tsv
- statistics and runtime information pertaining to the SV candidate generation
- svCandidateGenerationStats.xml
- xml data backing the svCandidateGenerationStats.tsv report
Manta workflows are parallelized at the process level using the pyFlow task manager. pyFlow can distrubute Manta workflows to a specified number of cores on a single host or SGE-managed cluster.
As a useful runtime benchmark, Platinum Genomes sequencing reads for NA12878 at 50x coverage (whole genome) can be analyzed in less than 20 minutes on 20 physical cores using a dual Xeon E5-2680 v2 server with the BAM accessed from a conventional local drive, peak total memory (RSS) for this run was 2.35 Gb. Additional hardware notes:
-
Memory Typical memory requirements are <1Gb/core for germline analysis and <2Gb/core for cancer/FFPE/highly rearranged samples. The exact requirement depends on many factors including sequencing depth, read length, fragment size and sample quality.
-
CPU Manta does not require or benefit from any specific modern CPU feature (e.g. NUMA, AVX..), but in general faster clock and larger caches will improve performance.
-
I/O I/O can be roughly approximated as 1.1 reads of the input alignment file per analysis, with no writes that are significant relative to the alignment file size.
Manta is run in a two step procedure: (1) configuration and (2) workflow execution. The configuration step is used to specify the input data and any options pertaining to the variant calling methods themselves. The execution step is used to specify any parameters pertaining to how manta is executed (such as the total number of cores or SGE nodes over which the jobs should be parallelized). The second execution step can also be interrupted and restarted without changing the final result of the workflow.
The workflow is configured with the script:
${MANTA_INSTALL_PATH}/bin/configManta.py
. Running this script with
no arguments will display all standard configuration options to
folder. Note that all input alignment (BAM or CRAM) files and reference sequence must
contain the same chromosome names in the same order. In addition all
input alignment files and reference sequences must be indexed with
samtools
(or a utility which creates equivilent index
files). Manta's default settings assume a whole genome DNA-Seq
analysis, but there are configuration options for exome/targeted
sequencing analysis in addition to RNA-Seq.
On completion, the configuration script will create the workflow run
script ${MANTA_ANALYSIS_PATH}/runWorkflow.py
. This can be used to
run the workflow in various parallel compute modes per the
instructions in the Execution section below.
Single Diploid Sample Analysis -- Example Configuration:
${MANTA_INSTALL_PATH}/bin/configManta.py \
--bam NA12878_S1.bam \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}
Joint Diploid Sample Analysis -- Example Configuration:
${MANTA_INSTALL_PATH}/bin/configManta.py \
--bam NA12878_S1.cram \
--bam NA12891_S1.cram \
--bam NA12892_S1.cram \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}
Tumor Normal Analysis -- Example Configuration:
${MANTA_INSTALL_PATH}/bin/configManta.py \
--normalBam HCC1187BL.cram \
--tumorBam HCC1187C.cram \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}
Tumor-Only Analysis -- Example Configuration:
${MANTA_INSTALL_PATH}/bin/configManta.py \
--tumorBam HCC1187C.cram \
--referenceFasta hg19.fa \
--runDir ${MANTA_ANALYSIS_PATH}
Manta calls the entire genome by default, however variant calling may be restricted to
an arbitrary subset of the genome by providing a region file in BED format with
the --callRegions
configuration option. The BED file must be bgzip-compressed and tabix-indexed,
and only one such BED file may be specified. When specified, all VCF output is restricted to
the provided call regions only, however statistics derived from the input data
(such as expected fragment size distribution) will not be restricted to the call regions.
It is not recommended to set up a large number of call regions because it may cause Manta to have a reduced efficiency in segmenting and processing the genome.
Note in particular that even when --callRegions
is specified,
the --exome
flag is still required for exome or targeted data
to get appropriate depth filtration behavior for non-WGS cases.
There are two sources of advanced configuration options:
- Options listed in the file:
${MANTA_INSTALL_PATH}/bin/configManta.py.ini
- These parameters are not expected to change frequently. Changing the file
listed above will re-configure all manta runs for the installation. To change
parameters for a single run, copy the configManta.py.ini file to another location,
change the desired parameter values and supply the new file using the configuration
script's
--config FILE
option.
- These parameters are not expected to change frequently. Changing the file
listed above will re-configure all manta runs for the installation. To change
parameters for a single run, copy the configManta.py.ini file to another location,
change the desired parameter values and supply the new file using the configuration
script's
- Advanced options listed in:
${MANTA_INSTALL_PATH}/bin/configManta.py --allHelp
- These options are intended primarily for workflow development and debugging, but could be useful for runtime optimization in some specialized cases.
Using the --generateEvidenceBam
option, Manta can be configured to generate bam files of evidence reads for SVs listed in the candidate vcf file.
It is recommended to use this option together with the --region
option, so that the analysis is limited to relatively small genomic regions for debugging purposes.
The evidence bam files are provided in ${MANTA_ANALYSIS_PATH}/results/evidence
, with a naming format evidence_*.*.bam
.
There is one such file for each input bam of the analysis, containing evidence reads of the candidate SVs identified from that input bam.
Each read in an evidence bam keeps all information from the original bam, and it contains also a customized tag in the format: ZM:Z:${MANTA_SV_ID_1}|${EVIDENCE_TYPE},${MANTA_SV_ID_2}|${EVIDENCE_TYPE}
. For example, ZM:Z:MantaINV:5:0:1:0:0:0|PR|SRM,MantaDEL:5:1:2:0:0:0|SR
- One read can have more than one of the three evidence types: PR for paired reads, SR for split reads, and SRM for split read mates.
- One read can be evidence for multiple SVs, which are separated by commas in the tag.
Notice that the number of evidence reads for a particular SV in the evidence bam files could be more than the evidence counts (PR and SR) in the final vcf files. This is because more stringent criteria are applied for generating evidence counts in the final vcf files.
Using the --outputContig
option, Manta can be configured to output assembled contig sequences
in the final VCF files.
The contig sequence of each precise SV will be provided in the INFO field CONTIG
.
The configuration step creates a new workflow run script in the requested run directory:
${MANTA_ANALYSIS_PATH}/runWorkflow.py
This script is used to control parallel execution of Manta via the pyFlow task engine on a single compute node.
A running workflow can be interrupted at any time and resumed where it left off.
For a full list of execution options, see:
${MANTA_ANALYSIS_PATH}/runWorkflow.py -h
Example execution on a single node:
${MANTA_ANALYSIS_PATH}/runWorkflow.py -j 8
These options are useful for Manta development and debugging:
- Stderr logging can be disabled with
--quiet
argument. Note this log is replicated to${MANTA_ANALYSIS_PATH}/workspace/pyflow.data/logs/pyflow_log.txt
so there is no loss of log information. - The
--rescore
option can be provided to force the workflow to re-execute candidates discovery and scoring, but not the initial graph generation steps. - The
--generateEvidenceBam
option can be used to generate bam files of evidence reads for SVs listed in the candidate vcf file. (More details in the section "Generating evidence bams" below)
For both germline and somatic analysis, Manta may have runtime issues while attempting to process the large number of small decoys and unplaced/unlocalized contigs found in GRCh38 and other reference genomes. Until those issue can be resolved, runtime can be improved for such cases by excluding smaller contigs from analysis. This can be done in Manta by creating a bed file of all the chromosomes that should be included in the analysis, and providing it as an argument to the call regions configuration option. For instance, the following bed file could be provided for GRCh38 to exclude all decoys and small contigs:
chr1 0 248956422
chr2 0 242193529
chr3 0 198295559
chr4 0 190214555
chr5 0 181538259
chr6 0 170805979
chr7 0 159345973
chr8 0 145138636
chr9 0 138394717
chr10 0 133797422
chr11 0 135086622
chr12 0 133275309
chr13 0 114364328
chr14 0 107043718
chr15 0 101991189
chr16 0 90338345
chr17 0 83257441
chr18 0 80373285
chr19 0 58617616
chr20 0 64444167
chr21 0 46709983
chr22 0 50818468
chrX 0 156040895
chrY 0 57227415
chrM 0 16569
Supplying the --exome
flag at configuration time will provide
appropriate settings for WES and other regional enrichment
analyses. At present this flag disables all high depth filters, which
are designed to exclude pericentromeric reference compressions in the
WGS case but cannot be applied correctly to a targeted analysis.
For small targeted regions, it may also be helpful to consider the high sensitivity calling documentation below.
Manta supports SV calling for tumor sample only. The tumor-only mode can be triggered by supplying a tumor sample alignment file but no alignments for the normal sample. The results are reported in tumorSV.vcf.gz. This file contains all SV candidates (similar to the candidateSV.vcf.gz file), but also includes paired and split read evidence for each allele and a subset of the filters used for the tumor-normal comparative analysis.
Note that Manta does not yet provide a quality scoring model for unpaired tumor sample analysis. Users interested in
selecting for higher precision subsets of the unpaired tumor calls may consider selection based on the counts of paired
and split reads supporting each allele (SAMPLE/PR
and SAMPLE/SR
respectively). Note that the split read counts will
not always be available because some calls will be imprecise, so the paired read count (SAMPLE/PR
), could be used as a
simple starting point for filtration, but it is more accurate to consider the split and paired read support counts
together if a more accurate filter is required. The status of a call's IMPRECISE
flag may also be a strong indicator
of its reliability.
For example, in the unpaired tumor analysis output below, the records could be filtered to only include those with
SAMPLE/PR[1] >= 15 || SAMPLE/SR[1] >= 15
. This would remove the deletion record, because the paired-read count
for the deletion allele is 13 and the split-read count is not known. The two translocation breakends would not be
filtered because they have 15 and 19 split-read counts, respectively, supporting the breakend allele:
11 94975747 MantaBND:0:2:3:0:0:0:1 G G]8:107653520] . PASS SVTYPE=BND;MATEID=MantaBND:0:2:3:0:0:0:0;CIPOS=0,2;HOMLEN=2;HOMSEQ=TT;BND_DEPTH=216;MATE_BND_DEPTH=735 PR:SR 722,9:463,15
11 94975753 MantaDEL:0:1:2:0:0:0 T <DEL> . PASS END=94987865;SVTYPE=DEL;SVLEN=12112;IMPRECISE;CIPOS=-156,156;CIEND=-150,150 PR 161,13
11 94987872 MantaBND:0:0:1:0:0:0:0 T T[8:107653411[ . PASS SVTYPE=BND;MATEID=MantaBND:0:0:1:0:0:0:1;BND_DEPTH=171;MATE_BND_DEPTH=830 PR:SR 489,4:520,19
For low allele frequency variants, it may also be helpful to consider the high sensitivity calling documentation below.
Supplying the '--rna' flag at configuration time will provide experimental settings for RNA-Seq Fusion calling. At present this flag disables all high depth filters which are designed to exclude pericentromeric reference compressions in the WGS case but cannot be applied correctly to RNA-Seq analysis. In addition many custom RNA read processing and alignment steps are invoked. This mode is designed to function as part of larger workflow with additional steps to reduce overall false positive rate which take place downstream from Manta's fusion calling step.
When RNA mode is turned on, exactly one sample must be specified as normal
input only (using either the --bam
or --normalBam
option).
RNA Fusions are reported in rnaSV.vcf.gz in translocation format. Smaller variants are not reported.
It may also be helpful to consider the high sensitivity calling documentation below for this mode.
Manta is configured with a discovery sensitivity appropriate for general WGS applications. In targeted or other specialized contexts the candidate sensitivity can be increased. A recommended general high sensitivity mode can be obtained by changing the two values 'minEdgeObservations' and 'minCandidateSpanningCount' in the manta configuration file (see 'Advanced configuration options' above) to 2 observations per candidate (the default is 3):
# Remove all edges from the graph unless they're supported by this many 'observations'.
# Note that one supporting read pair or split read usually equals one observation, but
# evidence is sometimes downweighted.
minEdgeObservations = 2
# Run discovery and candidate reporting for all SVs/indels with at least this
# many spanning support observations
minCandidateSpanningCount = 2
Manta can be used for de novo calling, following a two-step procedure:
-
Manta can take multiple input bams for the normal sample, each bam file being treated as a separate sample as part of a joint diploid sample analysis, and then output a multi-sample vcf, where the sample order follows that of the input bams.
-
A post-processing script, provided as $MANTA_INSTALL_FOLDER/libexec/denovo_scoring.py, can be applied to the multi-sample vcf and detects SVs that have inheritance conflicts among a trio sample set.
The script usage is denovo_scoring.py It will ignore any samples in the vcf that are not specified at the commandline.
Under the same folder of the input vcf, the script outputs a new vcf file and a text file of stats for the de novo calls. Currently, all SVs with inheritance conflicts are labled with "DQ=60" inside the INFO, while all SVs without any conflict are labled with "DQ=0".