-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse complement of a sequence and possible supporting reverse complemented coordinates #15
Comments
This is traditionally handled by the strand column of a track file, e.g. BED, but it does create all kinds of issues for analysis (in the Genomic HyperBrowser, we have had to include strand-specific code in a bunch of statistics, and it is a common source of errors). However, the alternative would mean that any region-set (or track) would be comprised of two files referring to two different, but reverse complement seqcols, which complicates everything even more. One could, however, include the reverse compliments as an additional array in the same seqcol, for convenience. It would change the top-level digest of the seqcol. On the other hand, it would provide an elegant solution to #16, as the inclusion of a reverse compliment or not could be used together with the |
Quite right this is normally a case in point is the annotation has its strand marked up but start/end is expressed on the positive strand. I think Alex's question was more about how to request reverse complemented regions using refget. That could be broken down into two types of query. The following assumes we have the sequence
The first query is interesting. It could easily be added to the existing refget protocol without too much issue. Also it's not a breaking change for the API. Well unless you're handing back cDNA, CDS, protein, single strand DNA or mRNA etc. That second query could have been supported by passing in the original I think the biggest question for me is will we want to refer to the opposite strand of a sequence as a coordinate system to base knowledge upon. Currently variation in most databases I am aware of is plotted on the forward strand so the default position is always forward. Ensembl maps splicing on the reverse strand but its genomic coordinate space is always with reference to the forward strand i.e. @ahwagner I think it would be good to get more informaiton on this. |
Sorry, I seem to have overlooked this sentence and thought my idea for a solution was novel! :) But I guess it just strengthens this idea.
I have never seen the use of a reversed coordinate system for the opposite strand. I believe the natural API is that the user provides Writing out your thoughts a bit, I see two natural solutions, with some possible variants:
However, there is a complication for solution b. If the user only has access to the refget sequence digest, she/he would first neet to get access to a relevant seqcol digest. There is then a need of an additional lookup-step, where the user would need to select from a list of seqcols containing the sequence I still lean towards b here, as it feels more elegant. But it also depends on the bigger picture of what kind of information is/should be stored by a refget server, which are details I am not fully on top of (by accident I haven't taken part in the reverse lookup sessions recently). I see from the v1.0 of the refseq specification (https://samtools.github.io/hts-specs/refget.html) that the spec already allows the implementation of circular sequence coordinates (i.e. |
B certainly is far cleaner as it requires no additional API changes to support it. Just that a server needs to be able to do it plus there's no additional "oh it's not the right type of molecule" logic to handle. It's either known or it is unknown |
@ahwagner raised in the VRS/VCF meeting the idea of reverse complementing sequences and making them available in refget. The idea being if you are going to refer to an event on the opposing strand, such as a transcript, then asking for regions in the relative coordinate system would make sense.
From a basic refget POV this would require knowing what the sequence's reverse complement checksum is, converting that into the forward orientation alongside the requested coordinates and then reverse complementing the response.
However this does make the refget
/sequence/checksum
endpoint somewhat more complicated and brings with it additional semantics that a server may or may not support. This could be communicated via service-info. This problem might not be a seqcol issue but recording the reverse complement checksum could be an additional array.The text was updated successfully, but these errors were encountered: