Dedup ncbi #96

joverlee521 · 2024-10-11T23:56:08Z

Description of proposed changes

Dedup the records by sample id within the strain name of GenBank/Andersen lab records.

For example, the following strain names have duplicate sample id 24-005334-001

A/chicken/Ohio/24-005334-001/2024
A/Chicken/USA/24-005334-001-original/2024

Only keeps the first record of duplicates. When there are dups within a data source, the earliest released record is kept. When there are dups between GenBank and Andersen lab records, the GenBank record is kept.

Related issue(s)

Resolves #95

Checklist

Checks pass
Trial run

Resolves <#95> After appending the Andersen lab metadata to the NCBI metadata, dedup the records by sample id within the strain name. Then the sequences are filtered by the final strain names in the metadata TSV.

In investigating the duplicates dropped from the joined-ncbi metadata, I realized that these duplicates were not purely from the merge of the two data sources. This commit deduplicates by sample id in the upstream metadata as well. There's no need to change the processing of sequence FASTAs at this point because they are still matched by their respective accessions instead of strain name.

joverlee521 · 2024-10-11T23:57:59Z

I still want to do some more testing, but locally I see a difference in records in the final metadata TSVs.

$ for file in */results/metadata.tsv; do wc -l "$file"; done
    1577 andersen-lab-master/results/metadata.tsv
    1506 andersen-lab/results/metadata.tsv
    1892 joined-ncbi-master/results/metadata.tsv
    1639 joined-ncbi/results/metadata.tsv
    1373 ncbi-master/results/metadata.tsv
    1318 ncbi/results/metadata.tsv

joverlee521 · 2024-10-14T18:44:23Z

Trial run completed successfully.

A total of 253 duplicate records were removed (see dropped-dups.tsv.txt).

joverlee521 · 2024-10-14T18:46:56Z

I plan to merge this Wednesday morning if there are no comments.

trvrb · 2024-10-14T18:59:24Z

Thanks so much Jover! A couple spot checks looked appropriate (removing duplicates on master and keeping just the Genbank version). I might spend a couple more minutes with it, but if I don't get back to it, please do merge on Wednesday.

genehack

Any time I see something like this, I kinda wonder about effort-versus-return of adding in a check to make sure that the sequences for the "identical" strains are also identical…

joverlee521 · 2024-10-16T16:43:14Z

Any time I see something like this, I kinda wonder about effort-versus-return of adding in a check to make sure that the sequences for the "identical" strains are also identical…

@genehack in some cases the sequences are not identical even if they are the same sample because of different sequencing/assembly methods. Even if the sequences are different, we'd still want to dedup the data so that we have unique samples. I think the extra effort would then be to cross-check metadata and sequence quality to use the "better" record.

AngieHinrichs · 2024-10-16T16:46:55Z

Just wondering, had you seen https://github.com/andersen-lab/avian-influenza/blob/master/metadata/genbank_mapping.tsv and if so is it not doing a good enough job?

joverlee521 · 2024-10-16T20:27:17Z

Thanks for the pointer @AngieHinrichs! I don't think I've seen the genbank_mapping.tsv, that is extremely helpful!

We could update the pipeline to use the genbank_mapping.tsv to join SRA/Andersen lab records with GenBank records instead of our current method of joining by SRA accessions! I haven't fully explored the data, but looks like the genbank_mapping.tsv files maps the duplicate SRA records to the same GenBank records so it should resolve duplicate samples between the SRA and GenBank records.

I do wonder if we would still need to dedup within the GenBank data that are not present in the SRA/Andersen lab records.

AngieHinrichs · 2024-10-16T21:43:18Z

Great! It just matches SRA accessions and ignores the -original (and some other suffixes) that can get in the way. If you find any problems with it, let me know and I should be able to fix it. (It's generated automatically by a script that I contributed) I've been using it for my UShER tree builds.

joverlee521 added 2 commits October 11, 2024 16:23

ingest/joined-ncbi: Dedup by sample id in strain

58c0508

Resolves <#95> After appending the Andersen lab metadata to the NCBI metadata, dedup the records by sample id within the strain name. Then the sequences are filtered by the final strain names in the metadata TSV.

joverlee521 marked this pull request as ready for review October 14, 2024 18:45

genehack approved these changes Oct 15, 2024

View reviewed changes

joverlee521 merged commit 01fa60e into master Oct 16, 2024
14 checks passed

joverlee521 deleted the dedup-ncbi branch October 16, 2024 16:36

joverlee521 mentioned this pull request Oct 16, 2024

Use genbank_mapping.tsv to join GenBank and SRA/Andersen lab records #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedup ncbi #96

Dedup ncbi #96

joverlee521 commented Oct 11, 2024 •

edited

Loading

joverlee521 commented Oct 11, 2024

joverlee521 commented Oct 14, 2024

joverlee521 commented Oct 14, 2024

trvrb commented Oct 14, 2024

genehack left a comment

joverlee521 commented Oct 16, 2024

AngieHinrichs commented Oct 16, 2024

joverlee521 commented Oct 16, 2024

AngieHinrichs commented Oct 16, 2024

Dedup ncbi #96

Dedup ncbi #96

Conversation

joverlee521 commented Oct 11, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 commented Oct 11, 2024

joverlee521 commented Oct 14, 2024

joverlee521 commented Oct 14, 2024

trvrb commented Oct 14, 2024

genehack left a comment

Choose a reason for hiding this comment

joverlee521 commented Oct 16, 2024

AngieHinrichs commented Oct 16, 2024

joverlee521 commented Oct 16, 2024

AngieHinrichs commented Oct 16, 2024

joverlee521 commented Oct 11, 2024 •

edited

Loading