-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathcli-assemble-help.txt
199 lines (198 loc) · 13 KB
/
cli-assemble-help.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
usage: MCMC haplotype assembly [-h] [--region REGION] [--region-id REGION_ID]
[--targets TARGETS] [--variants VARIANTS]
[--bam BAM [BAM ...]] [--ploidy PLOIDY]
[--inbreeding INBREEDING]
[--sample-pool SAMPLE_POOL]
[--reference REFERENCE]
[--base-error-rate BASE_ERROR_RATE]
[--use-base-phred-scores]
[--mapping-quality MAPPING_QUALITY]
[--keep-duplicate-reads] [--keep-qcfail-reads]
[--keep-supplementary-reads]
[--read-group-field READ_GROUP_FIELD]
[--report [REPORT ...]] [--cores CORES]
[--mcmc-chains MCMC_CHAINS]
[--mcmc-steps MCMC_STEPS]
[--mcmc-burn MCMC_BURN] [--mcmc-seed MCMC_SEED]
[--mcmc-chain-incongruence-threshold MCMC_CHAIN_INCONGRUENCE_THRESHOLD]
[--mcmc-fix-homozygous MCMC_FIX_HOMOZYGOUS]
[--mcmc-llk-cache-threshold MCMC_LLK_CACHE_THRESHOLD]
[--mcmc-recombination-step-probability MCMC_RECOMBINATION_STEP_PROBABILITY]
[--mcmc-dosage-step-probability MCMC_DOSAGE_STEP_PROBABILITY]
[--mcmc-partial-dosage-step-probability MCMC_PARTIAL_DOSAGE_STEP_PROBABILITY]
[--mcmc-temperatures [MCMC_TEMPERATURES ...]]
[--haplotype-posterior-threshold HAPLOTYPE_POSTERIOR_THRESHOLD]
options:
-h, --help show this help message and exit
--region REGION Specify a single target region with the format
contig:start-stop. This region will be a single
variant in the output VCF. This argument can not be
combined with the --targets argument.
--region-id REGION_ID
Specify an identifier for the locus specified with the
--region argument. This id will be reported in the
output VCF.
--targets TARGETS Bed file containing multiple genomic intervals for
haplotype assembly. First three columns (contig,
start, stop) are mandatory. If present, the fourth
column (id) will be used as the variant id in the
output VCF.This argument can not be combined with the
--region argument.
--variants VARIANTS Tabix indexed VCF file containing SNP variants to be
used in assembly. Assembled haplotypes will only
contain the reference and alternate alleles specified
within this file.
--bam BAM [BAM ...] Bam file(s) to use in analysis. This may be (1) a list
of one or more bam filepaths, (2) a plain-text file
containing a single bam filepath on each line, (3) a
plain-text file containing a sample identifier and its
corresponding bam filepath on each line separated by a
tab. If options (1) or (2) are used then all samples
within each bam will be used within the analysis. If
option (3) is used then only the specified sample will
be extracted from each bam file and An error will be
raised if a sample is not found within its specified
bam file.
--ploidy PLOIDY Specify sample ploidy (default = 2).This may be (1) a
single integer used to specify the ploidy of all
samples or (2) a file containing a list of all samples
and their ploidy. If option (2) is used then each line
of the plaintext file must contain a single sample
identifier and the ploidy of that sample separated by
a tab.
--inbreeding INBREEDING
Specify expected sample inbreeding coefficient
(default = 0.0).This may be (1) a single floating
point value in the interval [0, 1] used to specify the
inbreeding coefficient of all samples or (2) a file
containing a list of all samples and their inbreeding
coefficient. If option (2) is used then each line of
the plaintext file must contain a single sample
identifier and the inbreeding coefficient of that
sample separated by a tab.
--sample-pool SAMPLE_POOL
WARNING: this is an experimental feature!!! Pool
samples together into a single genotype. This may be
(1) the name of a single pool for all samples or (2) a
file containing a list of all samples and their
assigned pool. If option (2) is used then each line of
the plaintext file must contain a single sample
identifier and the name of a pool separated by a
tab.Samples may be assigned to multiple pools by using
the same sample name on multiple lines.Each pool will
treated as a single genotype by combining all reads
from its constituent samples. Note that the pool names
should be used in place of the samples names when
assigning other per-sample parameters such as ploidy
or inbreeding coefficients.
--reference REFERENCE
Indexed fasta file containing the reference genome.
--base-error-rate BASE_ERROR_RATE
Expected base error rate of read sequences (default =
0.0024). The default value comes from Pfeiffer et al
2018 and is a general estimate for Illumina short
reads.
--use-base-phred-scores
Flag: use base phred-scores as a source of base error
rate. This will use the phred-encoded per base scores
in addition to the general error rate specified by the
--base-error-rate argument. Using this option can slow
down assembly speed.
--mapping-quality MAPPING_QUALITY
Minimum mapping quality of reads used in assembly
(default = 20).
--keep-duplicate-reads
Flag: Use reads marked as duplicates in the assembly
(these are skipped by default).
--keep-qcfail-reads Flag: Use reads marked as qcfail in the assembly
(these are skipped by default).
--keep-supplementary-reads
Flag: Use reads marked as supplementary in the
assembly (these are skipped by default).
--read-group-field READ_GROUP_FIELD
Read group field to use as sample id (default = "SM").
The chosen field determines tha sample ids required in
other input files e.g. the --sample-list argument.
--report [REPORT ...]
Extra fields to report within the output VCF. The
INFO/FORMAT prefix may be omitted to return both
variations of the named field. Options include:
INFO/AFPRIOR = Prior allele frequencies; INFO/ACP =
Posterior allele counts; INFO/AFP = Posterior mean
allele frequencies; INFO/AOP = Posterior probability
of allele occurring across all samples; INFO/AOPSUM =
Posterior estimate of the number of samples containing
an allele; INFO/SNVDP = Read depth at each SNV
position; FORMAT/ACP: Posterior allele counts;
FORMAT/AFP: Posterior mean allele frequencies;
FORMAT/AOP: Posterior probability of allele occurring;
FORMAT/GP: Genotype posterior probabilities;
FORMAT/GL: Genotype likelihoods; FORMAT/SNVDP: Read
depth at each SNV position
--cores CORES Number of cpu cores to use (default = 1).
--mcmc-chains MCMC_CHAINS
Number of independent MCMC chains per assembly
(default = 2).
--mcmc-steps MCMC_STEPS
Number of steps to simulate in each MCMC chain
(default = 2000).
--mcmc-burn MCMC_BURN
Number of initial steps to discard from each MCMC
chain (default = 1000).
--mcmc-seed MCMC_SEED
Random seed for MCMC (default = 42).
--mcmc-chain-incongruence-threshold MCMC_CHAIN_INCONGRUENCE_THRESHOLD
Posterior probability threshold for identification of
incongruent posterior modes (default = 0.60).
--mcmc-fix-homozygous MCMC_FIX_HOMOZYGOUS
Fix alleles that are homozygous with a probability
greater than or equal to the specified value (default
= 0.999). The probability of that a variant is
homozygous in a sample is assessed independently for
each variant prior to MCMC simulation. If an allele is
"fixed" it is not allowed vary within the MCMC thereby
reducing computational complexity.
--mcmc-llk-cache-threshold MCMC_LLK_CACHE_THRESHOLD
Threshold for determining whether to cache log-
likelihoods during MCMC to improve performance. This
value is computed as ploidy * variants * unique-reads
(default = 100). If set to 0 then log-likelihoods will
be cached for all samples including those with few
observed reads which is inefficient and can slow the
MCMC. If set to -1 then log-likelihood caching will be
disabled for all samples.
--mcmc-recombination-step-probability MCMC_RECOMBINATION_STEP_PROBABILITY
Probability of performing a recombination sub-step
during each step of the MCMC. (default = 0.5).
--mcmc-dosage-step-probability MCMC_DOSAGE_STEP_PROBABILITY
Probability of performing a dosage sub-step during
each step of the MCMC. (default = 1.0).
--mcmc-partial-dosage-step-probability MCMC_PARTIAL_DOSAGE_STEP_PROBABILITY
Probability of performing a within-interval dosage
sub-step during each step of the MCMC. (default =
0.5).
--mcmc-temperatures [MCMC_TEMPERATURES ...]
Specify inverse-temperatures to use for parallel
tempered chains (default = 1.0 i.e., no tempering).
This may be either (1) a list of floating point values
or (2) a file containing a list of samples with mcmc
inverse-temperatures. If option (2) is used then the
file must contain a single sample per line followed by
a list of tab separated inverse temperatures. The
number of inverse-temperatures may differ between
samples and any samples not included in the list will
default to not using tempering.
--haplotype-posterior-threshold HAPLOTYPE_POSTERIOR_THRESHOLD
Posterior probability required for a haplotype to be
included in the output VCF as an alternative allele.
The posterior probability of each haplotype is
assessed per individual and calculated as the
probability of that haplotype being present with one
or more copies in that individual.A haplotype is
included as an alternate allele if it meets this
posterior probability threshold in at least one
individual. This parameter is the main mechanism to
control the number of alternate alleles in ech VCF
record and hence the number of genotypes assessed when
recalculating likelihoods and posterior distributions
(default = 0.20).