Assembling the algae genome #572

Liukvr · 2017-08-17T08:53:58Z

Hello falcon team,

i'm going to asembly an algae genome that is supposed to be about 150 Mb and higly repetitive.
Following i report some statistics about the reads that we collected

Subreads statistics:

stats for subreads.fasta
sum = 21816661710, n = 2828818, ave = 7712.29, largest = 48520
N50 = 11507, n = 689722
N60 = 9924, n = 893581
N70 = 8269, n = 1133849
N80 = 6473, n = 1430581
N90 = 4309, n = 1837002
N100 = 35, n = 2828818

DBstats raw_reads:

Statistics for all wells in the data set

  2,828,818 reads        out of       2,828,818  (100.0%)

21,816,661,710 base pairs out of 21,816,661,710 (100.0%)

      7,712 average read length
      5,687 standard deviation

Base composition: 0.220(A) 0.279(C) 0.284(G) 0.217(T)

Distribution of Read Lengths (Bin size = 1,000)

    Bin:      Count  % Reads  % Bases     Average
 48,000:          1      0.0      0.0       48520
 47,000:          1      0.0      0.0       48164
 46,000:          1      0.0      0.0       47702
 45,000:          7      0.0      0.0       46117
 44,000:          5      0.0      0.0       45572
 43,000:         10      0.0      0.0       44708
 42,000:         16      0.0      0.0       43826
 41,000:         23      0.0      0.0       42978
 40,000:         42      0.0      0.0       41985
 39,000:         53      0.0      0.0       41159
 38,000:         94      0.0      0.0       40160
 37,000:        137      0.0      0.1       39200
 36,000:        182      0.0      0.1       38327
 35,000:        260      0.0      0.1       37436
 34,000:        375      0.0      0.2       36510
 33,000:        506      0.1      0.3       35613
 32,000:        710      0.1      0.4       34689
 31,000:        931      0.1      0.5       33795
 30,000:      1,330      0.2      0.7       32855
 29,000:      1,803      0.2      0.9       31915
 28,000:      2,344      0.3      1.3       31001
 27,000:      3,144      0.4      1.7       30076
 26,000:      4,131      0.6      2.2       29150
 25,000:      5,464      0.8      2.8       28219
 24,000:      7,395      1.0      3.6       27263
 23,000:      9,614      1.4      4.7       26319
 22,000:     12,382      1.8      5.9       25385
 21,000:     16,150      2.4      7.5       24445
 20,000:     21,366      3.1      9.5       23487
 19,000:     28,058      4.1     12.0       22521
 18,000:     35,931      5.4     15.1       21568
 17,000:     46,240      7.0     18.8       20617
 16,000:     57,901      9.1     23.2       19683
 15,000:     72,576     11.6     28.3       18758
 14,000:     86,776     14.7     34.1       17866
 13,000:    101,242     18.3     40.3       17009
 12,000:    113,090     22.3     46.8       16198
 11,000:    122,460     26.6     53.3       15433
 10,000:    130,514     31.2     59.5       14703
  9,000:    140,848     36.2     65.7       13987
  8,000:    152,060     41.6     71.6       13276
  7,000:    163,508     47.4     77.2       12571
  6,000:    175,299     53.6     82.4       11867
  5,000:    186,911     60.2     87.1       11167
  4,000:    197,868     67.2     91.2       10472
  3,000:    211,463     74.6     94.6        9773
  2,000:    267,993     84.1     97.6        8947
  1,000:    284,422     94.2     99.6        8155
      0:    165,181    100.0    100.0        7712

Reads length distribution:

I would like to ask if the following configuration file is fine or if you have any suggestions. The server that i'm going to use for the assembly has 32 threads and 512 gb of RAM.

Configuration_file

All recommendations or further questions are welcome.
Thanks in advance

Luca

The text was updated successfully, but these errors were encountered:

gconcepcion · 2017-08-24T21:05:17Z

Hi Luca,

The settings you have listed seem reasonable given the coverage and expected genome size you have. The only things I might recommend would be:

in pa_HPCdaligner_option & ovlp_HPCdaligner_option you may want to raise the kmer size to -k18 as the default is set at -k14. -k14 should give you better results, but in highly repetitive genomes it can result in hitting your memory limit, thus the need to raise k from 14 to 18 or so.
In the overlap_filter_settings, you may want to drop the --min_cov to a smaller number, maybe 4-5 or so. You have enough coverage to be using such a high number, so it might not be a big deal, but if you see less contiguity than you expect at the end, it may be worth it dropping that number by half.
Consider removing the -a option from pa_DBsplit_option. The -a option includes all subreads from a particular ZMW, whereas omitting '-a' will only include the best subread. If you're not coverage limited (and you clearly are not) then omitting -a should only improve your assembly.

Hope this helps

Liukvr · 2017-08-28T07:13:49Z

Hi Greg,

thanks a lot, they are very helpfull and pratical suggestion.
I will modify the configuration file according to your tips..
Best regards,

Luca

yuragal mentioned this issue Nov 8, 2017

Falcon assembly question #590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assembling the algae genome #572

Assembling the algae genome #572

Liukvr commented Aug 17, 2017

gconcepcion commented Aug 24, 2017 •

edited

Loading

Liukvr commented Aug 28, 2017

Assembling the algae genome #572

Assembling the algae genome #572

Comments

Liukvr commented Aug 17, 2017

gconcepcion commented Aug 24, 2017 • edited Loading

Liukvr commented Aug 28, 2017

gconcepcion commented Aug 24, 2017 •

edited

Loading