How to identify merged contigs in output #74

adadiehl · 2024-07-18T19:08:07Z

I am attempting to use quickmerge in an attempt to merge contigs from a long-read assembly into chromosome-length scaffolds using the hg38 reference genome. Please see the example command below:

merge_wrapper.py GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.fa -hco 5 -c 1.5 -l 500000 -v -t 24

The long-read assembly is from (here)[https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta] and hg38.fa is straight from UCSC.

Looking at the results, I am unable to tell which contigs in the output result from merging operations, so cannot compare to input contigs/chromosomes to ensure the output is correct. How do I identify merged contigs in the output fasta, and how do I tell what editing operations were done to generate the merged contigs? My goal here is to end up with the same number of contigs as hg38, excepting unmapped contigs, which appear to be written to the output unchanged. (Can you verify this is the case?)

Thank you in advance for your help!

mahulchak · 2024-07-19T03:18:12Z

The merged contigs will have names from the 'reference' assembly. Unchanged ones will have names of the 'query' assembly contigs.

…

On Thu, Jul 18, 2024, 2:08 PM Adam Diehl ***@***.***> wrote: I am attempting to use quickmerge in an attempt to merge contigs from a long-read assembly into chromosome-length scaffolds using the hg38 reference genome. Please see the example command below: merge_wrapper.py GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.fa -hco 5 -c 1.5 -l 500000 -v -t 24 The long-read assembly is from (here)[ https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta <https://urldefense.com/v3/__https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta__;!!CzAuKJ42GuquVTTmVmPViYEvSg!OulPU3IsW7KYoQ9kJ-hHD1xwvGCXqkrb1HH4SFWUps_4VxtxZY__3THKzB6R02T2zQqhO5FokVw9G_hf5WfOHsf5$>] and hg38.fa is straight from UCSC. Looking at the results, I am unable to tell which contigs in the output result from merging operations, so cannot compare to input contigs/chromosomes to ensure the output is correct. How do I identify merged contigs in the output fasta, and how do I tell what editing operations were done to generate the merged contigs? My goal here is to end up with the same number of contigs as hg38, excepting unmapped contigs, which appear to be written to the output unchanged. (Can you verify this is the case?) Thank you in advance for your help! — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/mahulchak/quickmerge/issues/74__;!!CzAuKJ42GuquVTTmVmPViYEvSg!OulPU3IsW7KYoQ9kJ-hHD1xwvGCXqkrb1HH4SFWUps_4VxtxZY__3THKzB6R02T2zQqhO5FokVw9G_hf5U0oGwbS$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABZQH2CTYQJU4QQCUGTCFCDZNAHCXAVCNFSM6AAAAABLDIOUISVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTOMJYGQ4TAMY__;!!CzAuKJ42GuquVTTmVmPViYEvSg!OulPU3IsW7KYoQ9kJ-hHD1xwvGCXqkrb1HH4SFWUps_4VxtxZY__3THKzB6R02T2zQqhO5FokVw9G_hf5YX2N8ag$> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

adadiehl · 2024-07-22T15:01:54Z

Thank you for the fast response!

I took another look at the output and am finding no evidence of merging. See the following experiment:

merge_wrapper.py ../../GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.chr1.fa -hco 1 -c 1 -l 0 -v -t 24
awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' <(echo "chr1") merged_out.fasta > tmp.txt
md5sum tmp.txt
c7f5ee27a469297fa051ecb355aa2d40  tmp.txt
md5sum hg38.chr1.fa
c7f5ee27a469297fa051ecb355aa2d40  hg38.chr1.fa

(where hg38.chr1.fa is simply the chr1 record from the full hg38 assembly, extracted into its own file)

So it seems chromosome 1 (plus all the contigs from the query assembly) are present in the output, but there is no evidence that chr1 was used to merge anything. It seems like quickmerge is simply passing all the contigs from both files through unchanged. I have tried setting the -hco, -c, and -l params unrealistically low to see if this affects merging but there was no change to the results. Why would this happen?

To make sure I understand correctly, the query assembly is the first argument to merge_wrapper.py and the second argument is the reference. Is that right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to identify merged contigs in output #74

How to identify merged contigs in output #74

adadiehl commented Jul 18, 2024

mahulchak commented Jul 19, 2024 via email

adadiehl commented Jul 22, 2024 •

edited

Loading

How to identify merged contigs in output #74

How to identify merged contigs in output #74

Comments

adadiehl commented Jul 18, 2024

mahulchak commented Jul 19, 2024 via email

adadiehl commented Jul 22, 2024 • edited Loading

adadiehl commented Jul 22, 2024 •

edited

Loading