forked from xfengnefx/hifiasm-meta
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhifiasm_meta.1
270 lines (216 loc) · 7.9 KB
/
hifiasm_meta.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
.TH hifiasm_meta 1 "Jan 2023" "hifiasm_meta-0.3" "Bioinformatics tools"
.SH NAME
.PP
hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.
.SH SYNOPSIS
* Assemble HiFi reads:
.RS 4
.B hifiasm_meta
.RB [ -o
.IR prefix ]
.RB [ -t
.IR nThreads ]
.RB [ -z
.IR endTrimLen ]
.RB [ -S
.IR enableReadSelection
]
.RB [ -B
.IR useBinFilesFromOtherDir
]
.R [options]
.I input1.fq
.RI [ input2.fq
.R [...]]
.RE
.SH DESCRIPTION
.PP
Hifiasm is an ultrafast haplotype-resolved de novo assembler for PacBio
Hifi reads. Unlike most existing assemblers, hifiasm starts from uncollapsed
genome. Thus, it is able to keep the haplotype information as much as possible.
The input of hifiasm is the PacBio Hifi reads in fasta/fastq format, and its
outputs consist of multiple types of assembly graph in GFA format.
Hifiasm_meta is a fork of hifiasm. It comes with a read selection module, which
enables the assembly of dataset of high redundancy without compromising overall
assembly quality, and meta-centric graph cleaning modules.
In post-assembly stage, hifiasm\_meta traverses the primay assembly graph
and try to rescue some genome bins that would be overlooked by traditional binners.
Currently hifiasm_meta does not take bining info.
.SH SPECIAL NOTES
.PP
Based on the limited available test data, real datasets are unlikely to require
read selection; mock datasets, however, might need it.
Bin file is somewhat compatible with the stable hifiasm.
Stable hifiasm can use hifiasm_meta's bin file (tested for v0.13).
Hifiasm_meta can use hifiasm's bin files by specifying
--use-ha-bin PREFIX (ignores some information; added for testing purposes).
.SH OPTIONS
.SS General options
.TP 10
.BI -o \ FILE
Prefix of output files [hifiasm_meta.asm]. For detailed description of all assembly
graphs, please see the
.B OUTPUTS
section of this man-page.
.TP
.BI -B
Use bin files under a different prefix than the one specified by -o.
.TP 10
.BI -t \ INT
Number of CPU threads used by hifiasm_meta [1].
.TP
.BI -h
Show help information and exit. Returns 0.
.TP
.BI --version
Show version number and exit. Returns 0.
.SS Read selection options
.TP 10
.BI -S
Enable read selection (default is disabled).
If enabled, hifiasm_meta will estimate the total number of read overlaps.
If the estimation seems within acceptable, no read will be dropped; otherwise,
reads will be dropped from the most redundant ones until the criteria
are satisfied.
.TP
.BI --force-rs
Force read selection.
Hifiasm_meta will not estimate the total number of read overlaps,
and drops reads according to the runtime kmer
frequency thresholds described below.
(--force-preovec is an alias of this option.)
.TP
.BI --lowq-10 \ INT
Runtime 10% quantile kmer frequency threshold used for read selection [50].
Lower value means less reads kept (if read selection is triggered).
.TP
.BI --lowq-5 \ INT
Runtime 5% quantile kmer frequency threshold used for read selection [50].
Lower value means less reads kept (if read selection is triggered).
.TP
.BI --lowq-3 \ INT
Runtime 3% quantile kmer frequency threshold used for read selection [-1, disabled].
Lower value means less reads kept (if read selection is triggered).
.SS Error correction options
.TP 10
.BI -k \ INT
K-mer length [51]. This option must be less than 64.
.TP
.BI -w \ INT
Minimizer window size [51].
.TP
.BI -f \ INT
Number of bits for bloom filter; 0 to disable [37]. This bloom filter is used
to filter out singleton k-mers when counting all k-mers. It takes
.RI 2^( INT -3)
bytes of memory.
.TP
.BI -r \ INT
Rounds of haplotype-aware error corrections [3]. This option affects all outputs of hifiasm_meta.
.TP
.BI --min-hist-cnt \ INT
When analyzing the k-mer spectrum, ignore counts below
.IR INT .
.SS Assembly options
.TP
.BI -z \ INT
Length of adapters that should be removed [0]. This option remove
.I INT
bases from both ends of each read.
Some old Hifi reads may consist of
short adapters (e.g., 20bp adapter at one end). For such data, trimming short adapters would
significantly improve the assembly quality.
.TP
.BI -i
Ignore error corrected reads and overlaps saved in
.IR prefix .*.bin
files.
Apart from assembly graphs, hifiasm_meta also outputs three binary files
that save all overlap information during assembly step.
With these files, hifiasm_meta can avoid the time-consuming all-to-all overlap calculation step,
and do the assembly directly and quickly.
This might be helpful when users want to get an optimized assembly by multiple rounds of experiments
with different parameters.
.SS Debugging options
.TP 10
.B --dbg-gfa
Use extra bin files to speed up the debugging of graph cleaning.
If set and the extra bin files do not already exist,
assembly runs normally (i.e. from scratch or resume from bin files)
and writes the extra bin files.
If set and bin files as well as extra bin files are present,
assembly will resume from raw unitig graph stage.
.TP
.BI --dump-all-ovlp
Hifiasm(_meta) only keeps a limited number of overlap targets for a read (specified by -N).
Overlaps discarded here will not show up in the --write-paf dump.
They are irrelevant in most use cases, but if the user is interested,
this switch can write all overlaps ever calculated in the final round of overlapping calculation
to a paf file named
.IR prefix .all_ovlp_pairs
(paf format).
.TP
.BI --write-paf
Dump overlaps, produces 2 files, one contains the intra-haplotype or unphased overlaps,
the other contains inter-haplotype overlaps. If coverage is very high,
this might not be the full set of overlaps.
.TP
.BI --write-ec
Dump error-corrected reads.
.TP
.BI -e
Ban assembly, i.e. terminate before generating string graph.
.SH OUTPUTS
.PP
Hifiasm_meta generates the following assembly graphs in the GFA format
and topology-based genome bin rescuing in FASTA format:
.RS 2
.TP 2
*
.IR prefix .r_utg.gfa:
the raw unitig graph. This graph keeps all haplotype information.
.TP
*
.IR prefix .p_utg.gfa:
processed unitig graph without small bubbles and tips.
.TP
*
.IR prefix .p_ctg.gfa:
assembly graph of primary contigs. This graph collapses different haplotypes.
.TP
*
.IR prefix .a_ctg.gfa:
assembly graph of alternate contigs. This graph consists of contigs that
are discarded in primary contig graph.
.TP
*
.IR prefix .rescue.fa:
circular paths that might be valid genome bins.
They are deduplicated at 5% mash distance.
They are error-prone and needs to be checked for completeness and contamination.
Contig names are stored in comments and should refer to the primary assembly.
.IR prefix .rescue.all.fa:
circular paths that might be valid genome bins,
but without the deduplication.
.RE
.PP
For each graph, hifiasm(_meta) also outputs a simplified version without sequences for
the ease of visualization ("noseq" in the file names).
Hifiasm(_meta) keeps corrected reads and overlaps in three
binary files such as it can regenerate assembly graphs from the binary files
without redoing error correction.
Hifiasm_meta's contig graphs contain the following subgraph information:
1) Contig names are `^s[0-9]+\.[uc]tg[0-9]{6}[lc]` where the `s[0-9]+` is
a disconnected subgraph label of the contig.
It might be useful to be able to quickly checking whether two contigs are
in the same disconnected subgraph (i.e. haplotype that wasn't assembled
in to a single contig, tangled haplotypes).
2) A trail of subgraph labels is stored in a custom gfa tag
ts:B:I (uint32_t array, comma seperated). This can be used to quickly check
if two unconnected contigs or subgraphs were once connected during graph cleaning,
since for metagenome assemblies, the alterantive graph and unitigs are more ambiguous
the assembly of a single organism. The representation, for example,
say we have 3 graphs, A, B and C. A is splitted into two graphs, A1 and A2;
A2 is further splicted into A2a, A2b and A2c.
Then A1 may be labelled as 0.0, A2a as 0.1.0, A2b as 0.1.1, A2c as 0.1.2,
B as 1, and C as 2 (integers are arbitrary but unique).