Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

Falcon assembly question #590

Closed
yuragal opened this issue Nov 8, 2017 · 2 comments
Closed

Falcon assembly question #590

yuragal opened this issue Nov 8, 2017 · 2 comments

Comments

@yuragal
Copy link

yuragal commented Nov 8, 2017

Hello,

I am trying to assemble algal genome with falcon+faclon-unzip. The genome size is thought to be 100-150Mbp according to k-mer analysis of Illumina gDNA reads and assemblies generated by various DBG --> OLC and PacBio de novo pipelines. However, the direct estimate of gDNA content per cell is unknown, so I am not completely sure in this figure. Meanwhile, according to k-mer analysis, genome contains long stretches of duplicated fragments. Some people designate this phenomenon as partial diploidy. Under settings specified below, falcon generates assembly of ~23Mbp (a_ctg_base.fa), which is four times smaller than one obtained by feeding canu with the same input data.

My gut feeling is that the raw coverage might be too low. Distribution of preads' overlaps is long-tailed suggesting considerable amount of repetitive genome content that could confuse falcon on the 2-asm assembly stage. Though, I am quite uncomfortable with moving parameters back and forth to optimize the assembly yield.

Could you please have a look at falcon output to diagnose the problem and suggest some recipe?

raw_read stats
Statistics for all reads of length 500 bases or more

446,612 reads        out of         716,581  ( 62.3%)
4,410,970,471 base pairs   out of   5,919,435,728  ( 74.5%)

        9,876 average read length
        5,405 standard deviation

Base composition: 0.319(A) 0.193(C) 0.192(G) 0.296(T)

Distribution of Read Lengths (Bin size = 1,000)

        Bin:      Count  % Reads  % Bases     Average
    50,000:          1      0.0      0.0       50052
    ...
    26,000:      1,002      1.0      3.1       29658
    ....
    16,000:     11,720     12.1     24.5       20013
    15,000:     14,407     15.3     29.5       19057
    14,000:     18,367     19.4     35.5       18087
    13,000:     22,384     24.4     42.4       17142
    12,000:     27,307     30.5     50.1       16210
    11,000:     32,371     37.8     58.5       15304
    10,000:     36,670     46.0     67.3       14445
     9,000:     36,688     54.2     75.2       13697
     8,000:     31,174     61.2     81.2       13105
     7,000:     28,786     67.6     86.1       12571
     6,000:     28,683     74.1     90.3       12045
     5,000:     27,671     80.3     93.8       11539
     4,000:     26,304     86.1     96.5       11059
     3,000:     23,531     91.4     98.3       10624
     2,000:     19,166     95.7     99.4       10260
     1,000:     13,944     98.8     99.9        9984
         0:      5,246    100.0    100.0        9876
pre_assembly stats
"genome_length": 100000000,
"length_cutoff": 5000,
"preassembled_bases": 2490183741,
"preassembled_coverage": 24.902,
"preassembled_esize": 8608.818,
"preassembled_mean": 5739.84,
"preassembled_n50": 8175,
"preassembled_p95": 13235,
"preassembled_reads": 433842,
"preassembled_seed_fragmentation": 1.387,
"preassembled_seed_truncation": 3454.148,
"preassembled_yield": 0.602,
"raw_bases": 4410970471,
"raw_coverage": 44.11,
"raw_esize": 12835.258,
"raw_mean": 9876.516,
"raw_n50": 12015,
"raw_p95": 19563,
"raw_reads": 446612,
"seed_bases": 4136139020,
"seed_coverage": 41.361,
"seed_esize": 13448.962,
"seed_mean": 11539.89,
"seed_n50": 12406,
"seed_p95": 20442,
"seed_reads": 358421
5' overlaps between preads
# Whole data
# NumSamples = 216996; Min = 0.00; Max = 4570.00
# Mean = 395.276005; Variance = 623048.227716; SD = 789.334041; Median 27.000000
# each ∎ represents a count of 2328
   0.0000 -   457.0000 [174617]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 457.0000 -   914.0000 [  8241]: ∎∎∎
 914.0000 -  1371.0000 [  3817]: ∎
1371.0000 -  1828.0000 [  2776]: ∎
1828.0000 -  2285.0000 [  5952]: ∎∎
2285.0000 -  2742.0000 [ 21576]: ∎∎∎∎∎∎∎∎∎
2742.0000 -  3199.0000 [    15]: 
3199.0000 -  3656.0000 [     1]: 
3656.0000 -  4113.0000 [     0]: 
4113.0000 -  4570.0000 [     1]: 

# Range 0-100 overlaps
# NumSamples = 216996; Min = 0.00; Max = 100.00
# 56135 values outside of min/max
# Mean = 73.497681; Variance = 732788.207790; SD = 856.030495; Median 1320.500000
# each ∎ represents a count of 735
 0.0000 -    10.0000 [ 22532]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
10.0000 -    20.0000 [ 55171]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
20.0000 -    30.0000 [ 43817]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
30.0000 -    40.0000 [ 24043]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
40.0000 -    50.0000 [  6762]: ∎∎∎∎∎∎∎∎∎
50.0000 -    60.0000 [  2711]: ∎∎∎
60.0000 -    70.0000 [  1881]: ∎∎
70.0000 -    80.0000 [  1415]: ∎
80.0000 -    90.0000 [  1277]: ∎
90.0000 -   100.0000 [  1252]: ∎

# Range 100+ overlaps
# NumSamples = 216996; Min = 100.00; Max = 4570.00
# 160754 values outside of min/max
# Mean = 73.497681; Variance = 732788.207790; SD = 856.030495; Median 1320.500000
# each ∎ represents a count of 258
 100.0000 -   547.0000 [ 15438]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 547.0000 -   994.0000 [  7396]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 994.0000 -  1441.0000 [  3681]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1441.0000 -  1888.0000 [  2703]: ∎∎∎∎∎∎∎∎∎∎
1888.0000 -  2335.0000 [  7609]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2335.0000 -  2782.0000 [ 19400]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2782.0000 -  3229.0000 [    14]: 
3229.0000 -  3676.0000 [     0]: 
3676.0000 -  4123.0000 [     0]: 
4123.0000 -  4570.0000 [     1]: 

Distribution of the number of preads' 3'-overlaps is very similar.

fc_run.cfg
[General]
use_tmpdir = true
job_type = PBS
input_fofn = input.fofn
input_type = raw

length_cutoff = 5000
length_cutoff_pr = 5000
genome_size = 100000000
seed_coverage = 30

# Nodes are 2x18 Intel Xeon with 128Gb of RAM per node
job_queue = reserve
sge_option_da   = -l nodes=1:intel:ppn=8,walltime=1:00:00
sge_option_la   = -l nodes=1:intel:ppn=4,walltime=1:00:00
sge_option_cns  = -l nodes=1:intel:ppn=18,walltime=1:00:00
sge_option_pda  = -l nodes=1:intel:ppn=8,walltime=1:00:00
sge_option_pla  = -l nodes=1:intel:ppn=8,walltime=1:00:00
sge_option_fc   = -l nodes=1:intel:ppn=36,walltime=1:00:00

default_concurrent_jobs   = 50

# initial overlap of raw reads
pa_HPCdaligner_option   = -v -B4 -l1000 -s1000 -e.70 -M32 -T8
pa_DBsplit_option   = -x500 -s200

# overlap of preads
ovlp_HPCdaligner_option = -v -B4 -l500  -s500 -e.96 -M32 -T16 -h60
ovlp_DBsplit_option = -x500 -s100

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 18
overlap_filtering_setting = --max_diff 100 --max_cov 200 --min_cov 2 --bestn 10 --n_core 18
a_ctg_base stats
Total length of sequence:       22821554 bp
Total number of sequences:      742
N25 stats:                      25% of total sequence length is contained in the 58 sequences >= 67816 bp
N50 stats:                      50% of total sequence length is contained in the 170 sequences >= 40073 bp
N75 stats:                      75% of total sequence length is contained in the 360 sequences >= 23103 bp
Total GC count:                 8777693 bp
GC %:                           38.46 %

Best, Yuri.

P.S. My question might be similar with #581 and #572. I however decided to open a new issue so that not to interfere with particular discussion threads. Sorry for possible duplicate.

@skingan
Copy link

skingan commented Feb 26, 2018

Hi Yuri,

You are looking at the stats for the associate contigs. How does 'p_ctg.fa' look? This is the primary contig assembly.

Your preassembly stats look ok, no red flags from what I can see, but you are limited in data quality and quantity. Your preassembled yield (60%) is on the low side and your raw and pread length distribution isn't great. The raw coverage is also low, which may explain the low yield.

We usually recommend 50x for an inbred.
Sarah

@skingan skingan closed this as completed Feb 26, 2018
@foala
Copy link

foala commented Feb 15, 2019

Hello,

I am trying to assemble algal genome with falcon+faclon-unzip. The genome size is thought to be 100-150Mbp according to k-mer analysis of Illumina gDNA reads and assemblies generated by various DBG --> OLC and PacBio de novo pipelines. However, the direct estimate of gDNA content per cell is unknown, so I am not completely sure in this figure. Meanwhile, according to k-mer analysis, genome contains long stretches of duplicated fragments. Some people designate this phenomenon as partial diploidy. Under settings specified below, falcon generates assembly of ~23Mbp (a_ctg_base.fa), which is four times smaller than one obtained by feeding canu with the same input data.

My gut feeling is that the raw coverage might be too low. Distribution of preads' overlaps is long-tailed suggesting considerable amount of repetitive genome content that could confuse falcon on the 2-asm assembly stage. Though, I am quite uncomfortable with moving parameters back and forth to optimize the assembly yield.

Could you please have a look at falcon output to diagnose the problem and suggest some recipe?

raw_read stats
pre_assembly stats
5' overlaps between preads

# Whole data
# NumSamples = 216996; Min = 0.00; Max = 4570.00
# Mean = 395.276005; Variance = 623048.227716; SD = 789.334041; Median 27.000000
# each ∎ represents a count of 2328
   0.0000 -   457.0000 [174617]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 457.0000 -   914.0000 [  8241]: ∎∎∎
 914.0000 -  1371.0000 [  3817]: ∎
1371.0000 -  1828.0000 [  2776]: ∎
1828.0000 -  2285.0000 [  5952]: ∎∎
2285.0000 -  2742.0000 [ 21576]: ∎∎∎∎∎∎∎∎∎
2742.0000 -  3199.0000 [    15]: 
3199.0000 -  3656.0000 [     1]: 
3656.0000 -  4113.0000 [     0]: 
4113.0000 -  4570.0000 [     1]: 

# Range 0-100 overlaps
# NumSamples = 216996; Min = 0.00; Max = 100.00
# 56135 values outside of min/max
# Mean = 73.497681; Variance = 732788.207790; SD = 856.030495; Median 1320.500000
# each ∎ represents a count of 735
 0.0000 -    10.0000 [ 22532]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
10.0000 -    20.0000 [ 55171]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
20.0000 -    30.0000 [ 43817]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
30.0000 -    40.0000 [ 24043]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
40.0000 -    50.0000 [  6762]: ∎∎∎∎∎∎∎∎∎
50.0000 -    60.0000 [  2711]: ∎∎∎
60.0000 -    70.0000 [  1881]: ∎∎
70.0000 -    80.0000 [  1415]: ∎
80.0000 -    90.0000 [  1277]: ∎
90.0000 -   100.0000 [  1252]: ∎

# Range 100+ overlaps
# NumSamples = 216996; Min = 100.00; Max = 4570.00
# 160754 values outside of min/max
# Mean = 73.497681; Variance = 732788.207790; SD = 856.030495; Median 1320.500000
# each ∎ represents a count of 258
 100.0000 -   547.0000 [ 15438]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 547.0000 -   994.0000 [  7396]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 994.0000 -  1441.0000 [  3681]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1441.0000 -  1888.0000 [  2703]: ∎∎∎∎∎∎∎∎∎∎
1888.0000 -  2335.0000 [  7609]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2335.0000 -  2782.0000 [ 19400]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2782.0000 -  3229.0000 [    14]: 
3229.0000 -  3676.0000 [     0]: 
3676.0000 -  4123.0000 [     0]: 
4123.0000 -  4570.0000 [     1]: 

Distribution of the number of preads' 3'-overlaps is very similar.

fc_run.cfg
a_ctg_base stats
Best, Yuri.

P.S. My question might be similar with #581 and #572. I however decided to open a new issue so that not to interfere with particular discussion threads. Sorry for possible duplicate.

Hi
May I know how did you generate the 5' overlaps between preads report?
Thank you

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants