-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat missing data as Ns not gaps #32
Conversation
Previously the (inlined) root-sequence was inferred from the sampled data, which meant we had stretches of gaps¹ within the root-sequence genome. Switching to the (genbank) references as the root sequence moves the inference of these gaps to an internal (basal) node. ¹ These are gap characters due to missing data at the 3' and 5' ends of segments which are represented as gaps by our usage of `augur align`, which then become internal gaps when the segments are joined.
Most sequences have missing 3' and 5' data (for each segment) which we were expressing as N's. In a typical analysis (in combination with `augur ancestral --keep-ambiguous`, as used here) such terminal gaps would be swapped out to Ns¹ but because we are concatenating segments they are no longer terminal. Replacing gaps with Ns solves this issue, with the side-effect that any true deletions will be represented as Ns. ¹ <https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html#gap-characters>
For the purposes of this build, it might make sense to keep use a root sequence that is closer to the outbreak. otherwise the results table will list all mutations relative to the root of the tree. So maybe we should just freeze the current inferred root and put it into the github repo. |
I made a little branch on top of yours that produces a root sequence with gff. I also hack the clade_membership into the tree -- otherwise nextclade complains... if this version of the tree is on nextstrain.org, we could run this in nextclade with links to the reference.fasta and gff. |
Rn/use root of tree as reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged my PR into yours. I think this is working fine and once we have released the new nextclade version we can introduce a link on the page to the nextclade dataset.
Thanks @rneher - tested again and all looks good :) excited to see this in nextclade shortly |
Most sequences have missing 3' and 5' data (for each segment) which we were expressing as N's. In a typical analysis (in combination with
augur ancestral --keep-ambiguous
, as used here) such terminal gaps would be swapped out to Ns¹ but because we are concatenating segments they are no longer terminal.Replacing gaps with Ns solves this issue, with the side-effect that any true deletions will be represented as Ns.
This is combined with using the (genbank) references as the root-sequence, which previously also had these regions of missing data in it due to the majority of tips having missing data!
¹ https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html#gap-characters
The changes to the dataset are relatively minor, but e.g. looking at nuc 2320 (PB2 3') before (LHS) and after (RHS):