You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.
i'm going to asembly an algae genome that is supposed to be about 150 Mb and higly repetitive.
Following i report some statistics about the reads that we collected
Subreads statistics:
stats for subreads.fasta
sum = 21816661710, n = 2828818, ave = 7712.29, largest = 48520
N50 = 11507, n = 689722
N60 = 9924, n = 893581
N70 = 8269, n = 1133849
N80 = 6473, n = 1430581
N90 = 4309, n = 1837002
N100 = 35, n = 2828818
DBstats raw_reads:
Statistics for all wells in the data set
2,828,818 reads out of 2,828,818 (100.0%)
21,816,661,710 base pairs out of 21,816,661,710 (100.0%)
7,712 average read length
5,687 standard deviation
Base composition: 0.220(A) 0.279(C) 0.284(G) 0.217(T)
I would like to ask if the following configuration file is fine or if you have any suggestions. The server that i'm going to use for the assembly has 32 threads and 512 gb of RAM.
The settings you have listed seem reasonable given the coverage and expected genome size you have. The only things I might recommend would be:
in pa_HPCdaligner_option & ovlp_HPCdaligner_option you may want to raise the kmer size to -k18 as the default is set at -k14. -k14 should give you better results, but in highly repetitive genomes it can result in hitting your memory limit, thus the need to raise k from 14 to 18 or so.
In the overlap_filter_settings, you may want to drop the --min_cov to a smaller number, maybe 4-5 or so. You have enough coverage to be using such a high number, so it might not be a big deal, but if you see less contiguity than you expect at the end, it may be worth it dropping that number by half.
Consider removing the -a option from pa_DBsplit_option. The -a option includes all subreads from a particular ZMW, whereas omitting '-a' will only include the best subread. If you're not coverage limited (and you clearly are not) then omitting -a should only improve your assembly.
Hello falcon team,
i'm going to asembly an algae genome that is supposed to be about 150 Mb and higly repetitive.
Following i report some statistics about the reads that we collected
Subreads statistics:
stats for subreads.fasta
sum = 21816661710, n = 2828818, ave = 7712.29, largest = 48520
N50 = 11507, n = 689722
N60 = 9924, n = 893581
N70 = 8269, n = 1133849
N80 = 6473, n = 1430581
N90 = 4309, n = 1837002
N100 = 35, n = 2828818
DBstats raw_reads:
Statistics for all wells in the data set
21,816,661,710 base pairs out of 21,816,661,710 (100.0%)
Base composition: 0.220(A) 0.279(C) 0.284(G) 0.217(T)
Distribution of Read Lengths (Bin size = 1,000)
Reads length distribution:
I would like to ask if the following configuration file is fine or if you have any suggestions. The server that i'm going to use for the assembly has 32 threads and 512 gb of RAM.
Configuration_file
All recommendations or further questions are welcome.
Thanks in advance
Luca
The text was updated successfully, but these errors were encountered: