Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat missing data as Ns not gaps #32

Merged
merged 10 commits into from
May 29, 2024
Merged

Treat missing data as Ns not gaps #32

merged 10 commits into from
May 29, 2024

Conversation

jameshadfield
Copy link
Member

Most sequences have missing 3' and 5' data (for each segment) which we were expressing as N's. In a typical analysis (in combination with augur ancestral --keep-ambiguous, as used here) such terminal gaps would be swapped out to Ns¹ but because we are concatenating segments they are no longer terminal.

Replacing gaps with Ns solves this issue, with the side-effect that any true deletions will be represented as Ns.

This is combined with using the (genbank) references as the root-sequence, which previously also had these regions of missing data in it due to the majority of tips having missing data!

¹ https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html#gap-characters


The changes to the dataset are relatively minor, but e.g. looking at nuc 2320 (PB2 3') before (LHS) and after (RHS):

Frame 11

Previously the (inlined) root-sequence was inferred from the sampled
data, which meant we had stretches of gaps¹ within the root-sequence
genome. Switching to the (genbank) references as the root sequence moves
the inference of these gaps to an internal (basal) node.

¹ These are gap characters due to missing data at the 3' and 5' ends of
segments which are represented as gaps by our usage of `augur align`,
which then become internal gaps when the segments are joined.
Most sequences have missing 3' and 5' data (for each segment) which we
were expressing as N's. In a typical analysis (in combination with
`augur ancestral --keep-ambiguous`, as used here) such terminal gaps
would be swapped out to Ns¹  but because we are concatenating segments
they are no longer terminal.

Replacing gaps with Ns solves this issue, with the side-effect that any
true deletions will be represented as Ns.

¹ <https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html#gap-characters>
@jameshadfield jameshadfield requested a review from rneher May 10, 2024 02:22
@rneher
Copy link
Member

rneher commented May 10, 2024

For the purposes of this build, it might make sense to keep use a root sequence that is closer to the outbreak. otherwise the results table will list all mutations relative to the root of the tree. So maybe we should just freeze the current inferred root and put it into the github repo.

@rneher
Copy link
Member

rneher commented May 10, 2024

I made a little branch on top of yours that produces a root sequence with gff. I also hack the clade_membership into the tree -- otherwise nextclade complains...

if this version of the tree is on nextstrain.org, we could run this in nextclade with links to the reference.fasta and gff.

@rneher
Copy link
Member

rneher commented May 10, 2024

Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged my PR into yours. I think this is working fine and once we have released the new nextclade version we can introduce a link on the page to the nextclade dataset.

@jameshadfield
Copy link
Member Author

Thanks @rneher - tested again and all looks good :) excited to see this in nextclade shortly

@jameshadfield jameshadfield merged commit f02c0c1 into master May 29, 2024
6 checks passed
@jameshadfield jameshadfield deleted the james/ref-gaps branch May 29, 2024 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants