Treat missing data as Ns not gaps #32

jameshadfield · 2024-05-10T02:21:59Z

Most sequences have missing 3' and 5' data (for each segment) which we were expressing as N's. In a typical analysis (in combination with augur ancestral --keep-ambiguous, as used here) such terminal gaps would be swapped out to Ns¹ but because we are concatenating segments they are no longer terminal.

Replacing gaps with Ns solves this issue, with the side-effect that any true deletions will be represented as Ns.

This is combined with using the (genbank) references as the root-sequence, which previously also had these regions of missing data in it due to the majority of tips having missing data!

¹ https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html#gap-characters

The changes to the dataset are relatively minor, but e.g. looking at nuc 2320 (PB2 3') before (LHS) and after (RHS):

Previously the (inlined) root-sequence was inferred from the sampled data, which meant we had stretches of gaps¹ within the root-sequence genome. Switching to the (genbank) references as the root sequence moves the inference of these gaps to an internal (basal) node. ¹ These are gap characters due to missing data at the 3' and 5' ends of segments which are represented as gaps by our usage of `augur align`, which then become internal gaps when the segments are joined.

Most sequences have missing 3' and 5' data (for each segment) which we were expressing as N's. In a typical analysis (in combination with `augur ancestral --keep-ambiguous`, as used here) such terminal gaps would be swapped out to Ns¹ but because we are concatenating segments they are no longer terminal. Replacing gaps with Ns solves this issue, with the side-effect that any true deletions will be represented as Ns. ¹ <https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html#gap-characters>

rneher · 2024-05-10T06:34:37Z

For the purposes of this build, it might make sense to keep use a root sequence that is closer to the outbreak. otherwise the results table will list all mutations relative to the root of the tree. So maybe we should just freeze the current inferred root and put it into the github repo.

rneher · 2024-05-10T17:41:09Z

I made a little branch on top of yours that produces a root sequence with gff. I also hack the clade_membership into the tree -- otherwise nextclade complains...

if this version of the tree is on nextstrain.org, we could run this in nextclade with links to the reference.fasta and gff.

rneher · 2024-05-10T17:41:13Z

https://github.com/nextstrain/avian-flu/tree/rn/use-root-of-tree-as-reference

Rn/use root of tree as reference

rneher

I merged my PR into yours. I think this is working fine and once we have released the new nextclade version we can introduce a link on the page to the nextclade dataset.

jameshadfield · 2024-05-29T20:55:11Z

Thanks @rneher - tested again and all looks good :) excited to see this in nextclade shortly

jameshadfield added 2 commits May 10, 2024 13:43

jameshadfield requested a review from rneher May 10, 2024 02:22

rneher added 3 commits May 10, 2024 18:40

add and use inferred root as reference

bd435a0

add gff3

22c3c8f

hack clade annotation into the tree

8a56942

rneher added 5 commits May 19, 2024 11:32

remove misleading info from inferred ancestral seq of cattle outbreak

70e4d0b

add nextclade parameters to auspice config

f15bf37

remove clade annotation and files dict in metadata extension

6e48e73

remove unused files

273d55b

Merge pull request #33 from nextstrain/rn/use-root-of-tree-as-reference

b623aed

Rn/use root of tree as reference

rneher approved these changes May 29, 2024

View reviewed changes

jameshadfield merged commit f02c0c1 into master May 29, 2024
6 checks passed

jameshadfield deleted the james/ref-gaps branch May 29, 2024 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat missing data as Ns not gaps #32

Treat missing data as Ns not gaps #32

jameshadfield commented May 10, 2024

rneher commented May 10, 2024

rneher commented May 10, 2024

rneher commented May 10, 2024

rneher left a comment

jameshadfield commented May 29, 2024

Treat missing data as Ns not gaps #32

Treat missing data as Ns not gaps #32

Conversation

jameshadfield commented May 10, 2024

rneher commented May 10, 2024

rneher commented May 10, 2024

rneher commented May 10, 2024

rneher left a comment

Choose a reason for hiding this comment

jameshadfield commented May 29, 2024