Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to identify merged contigs in output #74

Open
adadiehl opened this issue Jul 18, 2024 · 2 comments
Open

How to identify merged contigs in output #74

adadiehl opened this issue Jul 18, 2024 · 2 comments

Comments

@adadiehl
Copy link

I am attempting to use quickmerge in an attempt to merge contigs from a long-read assembly into chromosome-length scaffolds using the hg38 reference genome. Please see the example command below:

merge_wrapper.py GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.fa -hco 5 -c 1.5 -l 500000 -v -t 24

The long-read assembly is from (here)[https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta] and hg38.fa is straight from UCSC.

Looking at the results, I am unable to tell which contigs in the output result from merging operations, so cannot compare to input contigs/chromosomes to ensure the output is correct. How do I identify merged contigs in the output fasta, and how do I tell what editing operations were done to generate the merged contigs? My goal here is to end up with the same number of contigs as hg38, excepting unmapped contigs, which appear to be written to the output unchanged. (Can you verify this is the case?)

Thank you in advance for your help!

@mahulchak
Copy link
Owner

mahulchak commented Jul 19, 2024 via email

@adadiehl
Copy link
Author

adadiehl commented Jul 22, 2024

Thank you for the fast response!

I took another look at the output and am finding no evidence of merging. See the following experiment:

merge_wrapper.py ../../GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.chr1.fa -hco 1 -c 1 -l 0 -v -t 24
awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' <(echo "chr1") merged_out.fasta > tmp.txt
md5sum tmp.txt
c7f5ee27a469297fa051ecb355aa2d40  tmp.txt
md5sum hg38.chr1.fa
c7f5ee27a469297fa051ecb355aa2d40  hg38.chr1.fa

(where hg38.chr1.fa is simply the chr1 record from the full hg38 assembly, extracted into its own file)

So it seems chromosome 1 (plus all the contigs from the query assembly) are present in the output, but there is no evidence that chr1 was used to merge anything. It seems like quickmerge is simply passing all the contigs from both files through unchanged. I have tried setting the -hco, -c, and -l params unrealistically low to see if this affects merging but there was no change to the results. Why would this happen?

To make sure I understand correctly, the query assembly is the first argument to merge_wrapper.py and the second argument is the reference. Is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants