-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedup ncbi #96
Dedup ncbi #96
Conversation
Resolves <#95> After appending the Andersen lab metadata to the NCBI metadata, dedup the records by sample id within the strain name. Then the sequences are filtered by the final strain names in the metadata TSV.
In investigating the duplicates dropped from the joined-ncbi metadata, I realized that these duplicates were not purely from the merge of the two data sources. This commit deduplicates by sample id in the upstream metadata as well. There's no need to change the processing of sequence FASTAs at this point because they are still matched by their respective accessions instead of strain name.
I still want to do some more testing, but locally I see a difference in records in the final metadata TSVs.
|
Trial run completed successfully. A total of 253 duplicate records were removed (see dropped-dups.tsv.txt). |
I plan to merge this Wednesday morning if there are no comments. |
Thanks so much Jover! A couple spot checks looked appropriate (removing duplicates on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time I see something like this, I kinda wonder about effort-versus-return of adding in a check to make sure that the sequences for the "identical" strains are also identical…
@genehack in some cases the sequences are not identical even if they are the same sample because of different sequencing/assembly methods. Even if the sequences are different, we'd still want to dedup the data so that we have unique samples. I think the extra effort would then be to cross-check metadata and sequence quality to use the "better" record. |
Just wondering, had you seen https://github.com/andersen-lab/avian-influenza/blob/master/metadata/genbank_mapping.tsv and if so is it not doing a good enough job? |
Thanks for the pointer @AngieHinrichs! I don't think I've seen the genbank_mapping.tsv, that is extremely helpful! We could update the pipeline to use the genbank_mapping.tsv to join SRA/Andersen lab records with GenBank records instead of our current method of joining by SRA accessions! I haven't fully explored the data, but looks like the genbank_mapping.tsv files maps the duplicate SRA records to the same GenBank records so it should resolve duplicate samples between the SRA and GenBank records. I do wonder if we would still need to dedup within the GenBank data that are not present in the SRA/Andersen lab records. |
Great! It just matches SRA accessions and ignores the -original (and some other suffixes) that can get in the way. If you find any problems with it, let me know and I should be able to fix it. (It's generated automatically by a script that I contributed) I've been using it for my UShER tree builds. |
Description of proposed changes
Dedup the records by sample id within the strain name of GenBank/Andersen lab records.
For example, the following strain names have duplicate sample id
24-005334-001
Only keeps the first record of duplicates. When there are dups within a data source, the earliest released record is kept. When there are dups between GenBank and Andersen lab records, the GenBank record is kept.
Related issue(s)
Resolves #95
Checklist