Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedup ncbi #96

Merged
merged 2 commits into from
Oct 16, 2024
Merged

Dedup ncbi #96

merged 2 commits into from
Oct 16, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Oct 11, 2024

Description of proposed changes

Dedup the records by sample id within the strain name of GenBank/Andersen lab records.

For example, the following strain names have duplicate sample id 24-005334-001

  • A/chicken/Ohio/24-005334-001/2024
  • A/Chicken/USA/24-005334-001-original/2024

Only keeps the first record of duplicates. When there are dups within a data source, the earliest released record is kept. When there are dups between GenBank and Andersen lab records, the GenBank record is kept.

Related issue(s)

Resolves #95

Checklist

Resolves <#95>

After appending the Andersen lab metadata to the NCBI metadata, dedup
the records by sample id within the strain name. Then the sequences are
filtered by the final strain names in the metadata TSV.
In investigating the duplicates dropped from the joined-ncbi metadata,
I realized that these duplicates were not purely from the merge of
the two data sources.

This commit deduplicates by sample id in the upstream metadata as well.
There's no need to change the processing of sequence FASTAs at this
point because they are still matched by their respective accessions
instead of strain name.
@joverlee521
Copy link
Contributor Author

I still want to do some more testing, but locally I see a difference in records in the final metadata TSVs.

$ for file in */results/metadata.tsv; do wc -l "$file"; done
    1577 andersen-lab-master/results/metadata.tsv
    1506 andersen-lab/results/metadata.tsv
    1892 joined-ncbi-master/results/metadata.tsv
    1639 joined-ncbi/results/metadata.tsv
    1373 ncbi-master/results/metadata.tsv
    1318 ncbi/results/metadata.tsv

@joverlee521
Copy link
Contributor Author

Trial run completed successfully.

A total of 253 duplicate records were removed (see dropped-dups.tsv.txt).

@joverlee521 joverlee521 marked this pull request as ready for review October 14, 2024 18:45
@joverlee521
Copy link
Contributor Author

I plan to merge this Wednesday morning if there are no comments.

@trvrb
Copy link
Member

trvrb commented Oct 14, 2024

Thanks so much Jover! A couple spot checks looked appropriate (removing duplicates on master and keeping just the Genbank version). I might spend a couple more minutes with it, but if I don't get back to it, please do merge on Wednesday.

Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time I see something like this, I kinda wonder about effort-versus-return of adding in a check to make sure that the sequences for the "identical" strains are also identical…

@joverlee521 joverlee521 merged commit 01fa60e into master Oct 16, 2024
14 checks passed
@joverlee521 joverlee521 deleted the dedup-ncbi branch October 16, 2024 16:36
@joverlee521
Copy link
Contributor Author

Any time I see something like this, I kinda wonder about effort-versus-return of adding in a check to make sure that the sequences for the "identical" strains are also identical…

@genehack in some cases the sequences are not identical even if they are the same sample because of different sequencing/assembly methods. Even if the sequences are different, we'd still want to dedup the data so that we have unique samples. I think the extra effort would then be to cross-check metadata and sequence quality to use the "better" record.

@AngieHinrichs
Copy link

Just wondering, had you seen https://github.com/andersen-lab/avian-influenza/blob/master/metadata/genbank_mapping.tsv and if so is it not doing a good enough job?

@joverlee521
Copy link
Contributor Author

Thanks for the pointer @AngieHinrichs! I don't think I've seen the genbank_mapping.tsv, that is extremely helpful!

We could update the pipeline to use the genbank_mapping.tsv to join SRA/Andersen lab records with GenBank records instead of our current method of joining by SRA accessions! I haven't fully explored the data, but looks like the genbank_mapping.tsv files maps the duplicate SRA records to the same GenBank records so it should resolve duplicate samples between the SRA and GenBank records.

I do wonder if we would still need to dedup within the GenBank data that are not present in the SRA/Andersen lab records.

@AngieHinrichs
Copy link

Great! It just matches SRA accessions and ignores the -original (and some other suffixes) that can get in the way. If you find any problems with it, let me know and I should be able to fix it. (It's generated automatically by a script that I contributed) I've been using it for my UShER tree builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We have a large number of duplicated strains
4 participants