Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse complement of a sequence and possible supporting reverse complemented coordinates #15

Open
andrewyatz opened this issue Jun 30, 2021 · 4 comments

Comments

@andrewyatz
Copy link
Collaborator

@ahwagner raised in the VRS/VCF meeting the idea of reverse complementing sequences and making them available in refget. The idea being if you are going to refer to an event on the opposing strand, such as a transcript, then asking for regions in the relative coordinate system would make sense.

From a basic refget POV this would require knowing what the sequence's reverse complement checksum is, converting that into the forward orientation alongside the requested coordinates and then reverse complementing the response.

However this does make the refget /sequence/checksum endpoint somewhat more complicated and brings with it additional semantics that a server may or may not support. This could be communicated via service-info. This problem might not be a seqcol issue but recording the reverse complement checksum could be an additional array.

@sveinugu
Copy link
Collaborator

sveinugu commented Jul 6, 2021

This is traditionally handled by the strand column of a track file, e.g. BED, but it does create all kinds of issues for analysis (in the Genomic HyperBrowser, we have had to include strand-specific code in a bunch of statistics, and it is a common source of errors). However, the alternative would mean that any region-set (or track) would be comprised of two files referring to two different, but reverse complement seqcols, which complicates everything even more.

One could, however, include the reverse compliments as an additional array in the same seqcol, for convenience. It would change the top-level digest of the seqcol. On the other hand, it would provide an elegant solution to #16, as the inclusion of a reverse compliment or not could be used together with the alphabet array to determine whether the sequence collection is single-stranded or double-stranded DNA.

@andrewyatz
Copy link
Collaborator Author

Quite right this is normally a case in point is the annotation has its strand marked up but start/end is expressed on the positive strand.

I think Alex's question was more about how to request reverse complemented regions using refget. That could be broken down into two types of query. The following assumes we have the sequence ACGTACGTAC with a checksum of 45aff2fecf7615d56bc0567dffab9fa8 and are interested in the biological residues 7, 8 and 9 (GTA forward and TAC reverse).

  1. I want the reverse complement of the region expressed on the forward strand e.g. /sequence/id?start=6;end=9;revcomp=true
  2. I have coordinates for the reverse strand and want the DNA as it is represented on the reverse e.g. /sequence/reverse_id?start=1;end=4 (reverse complement of the checksum ID was 2823253800f73d0c16e1540419d75e63`)

The first query is interesting. It could easily be added to the existing refget protocol without too much issue. Also it's not a breaking change for the API. Well unless you're handing back cDNA, CDS, protein, single strand DNA or mRNA etc.

That second query could have been supported by passing in the original id and telling refget these are reversed coordinates so it would have had to have done new_start = length - end and new_end = length - start, then reverse complement the resulting sequence to return what was wanted. But query two would mean there's no need to change the semantics of start and end (start is always less than end)

I think the biggest question for me is will we want to refer to the opposite strand of a sequence as a coordinate system to base knowledge upon. Currently variation in most databases I am aware of is plotted on the forward strand so the default position is always forward. Ensembl maps splicing on the reverse strand but its genomic coordinate space is always with reference to the forward strand i.e. start < end.

@ahwagner I think it would be good to get more informaiton on this.

@sveinugu
Copy link
Collaborator

sveinugu commented Jul 10, 2021

This problem might not be a seqcol issue but recording the reverse complement checksum could be an additional array.

Sorry, I seem to have overlooked this sentence and thought my idea for a solution was novel! :) But I guess it just strengthens this idea.

I think the biggest question for me is will we want to refer to the opposite strand of a sequence as a coordinate system to base knowledge upon.

I have never seen the use of a reversed coordinate system for the opposite strand. I believe the natural API is that the user provides id, start, end, and revcomp, which is anyway directly stored in most track file formats (with strand instead of revcomp).

Writing out your thoughts a bit, I see two natural solutions, with some possible variants:

  1. Refseq handles the reverse complementation of the DNA. This would mean that either:
    1. Refseq does not really know whether the sequence is DNA or not. Given a revcomp parameter, it tries to calculate the reverse compliment on-the-fly by simply extracting the sequence in the provided region and then reverse complement the extracted sequence in a wrapper function. If the sequence is not DNA, it would not work and there would be an error message to the user. Other transformations (e.g. DNA -> protein) could potentially be supported in the same way. @andrewyatz I believe this is what you suggest in solution 1.
    2. Refseq knows whether the sequence is DNA or not. This is information which it is currently suggested we store a the seqcol alphabet array, or possibly a molecule_type array. The reverse complemented strand is still calculated on-the-fly, as in a1, but would only be carried out if the sequence is of the correct type. This is a bit of an implementation detail, but as a developer, I would prefer to add a check of sequence type before reverse complementing instead of just trying to reverse complement any type of sequence and then hope for the best (as in a1). Storing the molecule_type/alphabet in refseq would, however, complicate the relationship between refseq and seqcol, I believe. I am not fully on top of the refseq discussions though.
    3. Refseq knows whether the sequence is DNA, in which case a reverse-complemented sequence is already precalculated. A refseq server would then have to store the link between id and reverse_id and transform a revcomp query on id to a direct query on the reverse_id sequence (your solution 2, I believe). This would also require refseq to know about the length of the sequence (which it already does). As refseq now needs to know about both alphabet/molecule_type, length and the id<->reverse_id mapping, the overlap with seqcol starts to grow.
  2. Refseq does not handle the actual reverse complementation of the DNA sequence itself. It only operates on a given sequence and assumes this to be the forward sequence. Instead, reverse complement functionality is handled by seqcol. This assumes that seqcol stores both an `alphabet` (and/or possibly a molecule_type array) and also stores an reverse_sequence array in the case of double-stranded molecules. seqcol transforms a revcomp query to a regular refseq query on the reverse sequence (solution 2 above), using the length array to calculate the new_start = length - end and new_end = length - start. Adding the complement checksum to seqcol relieves refseq of the burden to manage this and I think it makes logical sense as double-stranded DNA in itself can be naturally thought of as a sequence collection of a positive and a negative strand.

However, there is a complication for solution b. If the user only has access to the refget sequence digest, she/he would first neet to get access to a relevant seqcol digest. There is then a need of an additional lookup-step, where the user would need to select from a list of seqcols containing the sequence id. Any of the seqcols should work, given that it contains the reverse_sequence array, but it would be a bit confusing for the user. One could provide a convenience function in order to hide this complexity. However, it illustrates the complication that the same sequence could be registered both with and without a reverse_sequence, independently of the actual logic of this. E.g. a sequence "ACGACGACG" could be both a DNA and a RNA sequence, even though they would both have the same refseq id. In case of DNA, there is a reverse strand that is naturally contained within the sequence collection. In case of RNA, even though there are obvious uses for calculating the reverse complement, it would not be natural for the reverse strand to be registered with a separate id, either in refseq or as a part of the sequence collection model.

I still lean towards b here, as it feels more elegant. But it also depends on the bigger picture of what kind of information is/should be stored by a refget server, which are details I am not fully on top of (by accident I haven't taken part in the reverse lookup sessions recently). I see from the v1.0 of the refseq specification (https://samtools.github.io/hts-specs/refget.html) that the spec already allows the implementation of circular sequence coordinates (i.e. end < start is allowed), but only if the sequences are circular. Thus, it is assumed that implementations in that case keep track of whether the sequence is circular or not. This is really another example of the same complication we meet here. The same digest might (in theory at least) refer to both a circular and a non-circular sequence, to both a DNA and an RNA sequence, or to a sequence both with and without a reverse complement, and so on. In such cases, should the distinguishing information be stored in the refget or the seqcol context? Given the recent advances, I believe the seqcol context would be the natural choice, and the requirement for the user to have a seqcol id to manage tasks depending on such info would be natural. I believe this would also have consequences for the management of circular coordinates in refseq.

@andrewyatz
Copy link
Collaborator Author

B certainly is far cleaner as it requires no additional API changes to support it. Just that a server needs to be able to do it plus there's no additional "oh it's not the right type of molecule" logic to handle. It's either known or it is unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants