-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow annotation of any UniProtKB identifier #471
Comments
I'd disagree--the last two seem easier. Number two would work within what we have now, and I think that H left Minerva to take external services. Number one would require a widget-level coordination, contraction, and refactoring that would take quite a bit of effort. |
To clarify, 1 and 2 are standalone solutions. solution 1 in isolation requires curators to memorize IDs both for entry and for recognizing IDs in the display. solution 3 depends on solution 1. The curator still has to inject an ID, but once injected it will be visible via labels in the normal way |
Additional clarification: here any uniprotkb ID means both:
Of the canonical IDs, we wish to prioritize GCRPs over non-GCRPs. This is certainly the case for any genome in the QfO set. For genomes outside this set, there may be no GCRP defined. Note that not every GCRP is a swiss-prot ID (although the overlap highest for human) E.g. |
if we went route3 and plugged in external APIs for either AC or lookup mygene is the cleanest, but may not be complete: biothings/mygene.info#19 |
Your characterization of 1 is correct from a curator point of view. 1 is conceptually standalone, but would require either a fair amount of client code coordination or the creation of an uber autocomplete (on the wishlist) which then needs to be uniformly rolled out. I believe that we want to get this solved before that would likely be completed. I'm unsure about the use of mygene out of the box for 3. For example, if I put in "fox2" (http://mygene.info/v3/query?q=fox2), I get a single "close" result for human, whereas amigo gives 3 exact and one partial results and neo gives an SGD exact and a few partials. |
good point, it would be better for lookup given an ID rather than for AC |
Note to self: behavior with unknown IDs can be tested with M3ExpressionParserTest |
#471 (comment) |
These issues seem to be the last ones we were looking at in this neighborhood: |
As it currently stands, assuming you have entered something in the GP (enabled_by) entry in the "Add annoton" section, free entry/copying and pasting is allowed for all entries. |
…d to server; extend tests to include tests for no-literal-id checking mode. Reword tests to use smaller ontology in test/resources. Also add a test to ensure that under no circumastances can non-CURIES like "ABC" be passed through as class IRIs See #53, #58 The overall context here is checking we do not have issues when we start to encourage SIB curators to paste in UniProt IDs see geneontology/noctua#471
To summarize: Number 1 is 'good enough' for a first pass for SIB curators. It turns out that this is possible with the add-annoton wizard. I personally don't use this much, and I've observed a variety of behaviors from different users. I filed #473 which I think will make add-annoton more generally usable. For the next pass, we should think about labels, but we have some breathing space now (would be happy to close this one and start a new ticket) |
…ed to server; extend tests to include tests for no-literal-id checking mode. Reword tests to use smaller ontology in test/resources. Also add a test to ensure that under no circumastances can non-CURIES like "ABC" be passed through as class IRIs See #53, #58 The overall context here is checking we do not have issues when we start to encourage SIB curators to paste in UniProt IDs see geneontology/noctua#471
For number 1, any further action will take some time or hinge on #261 . |
I'm bringing @pmasson55 (Swiss-Prot curator!) into this discussion |
From the BioHackathon, @JervenBolleman also has an interest in this. |
After few weeks of testing as curator, the current solution of entering UniProt Accessions works as a temporary solution but it’s really limiting for the curation. For any model I create there are multiple viral proteins, and using only UniProtKB: XXXXX, the pathway becomes hard to follow and read after the addition of only few proteins. For usability we need to have protein names. |
I completely agree that we need more than just the UniProtKB protein name, not just in the canvas but also in the Annotation Preview. It defeats the point of the Annotation Preview to have to export the GPAD to be certain that the correct thing has been annotated. Also, it would be nice if the autocomplete in the Add Annoton wizard worked the same way as in the Add Individual. Specifically Add Individual allowed me to use "Q8N4C6-7" as the text I entered to get autocompletion upon, while entering this text into the Add Annoton wizard only offered me ChEBI terms. In contrast, in the Add Annoton wizard, I had to type in "Nin Hs" in order to get the needed ID offered in the autocomplete list. This text worked equally well in the Add Individual option. |
I wanted to revisit this now, so that we can use the SAE (@tmushayahama ) to annotate any UniProt entry. I would suggest that we allow pasting of an identifier, as is done now. But then we can use the UniProt web services to programmatically retrieve the gene name, species etc. that we can then populate onto the gene product instance. So this information could then be viewed in Noctua. Using web services would be straightforward for the "canonical" UniProt entry, but @pmasson55, can you suggest how to get the relevant information for "chains" from the UniProt web services? What is an example identifier? |
Hi, |
Just to confirm, the input from UniProt can be an Entry (primary) accession, An isoform accession or a Chain identifier? In which case the following SPARQL query will give just the information required and not anything extra. There are also no cases where this can provide a wrong answer. Al it requires is replacing the PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX up:<http://purl.uniprot.org/core/>
SELECT
?protein
?organism
(GROUP_CONCAT(?geneLabel;separator=" , ") AS ?geneLabels)
WHERE
{
BIND('_INPUT_' AS ?input)
BIND(
IF(strstarts(?input, 'PRO_'),
IRI(CONCAT('http://purl.uniprot.org/annotation/', ?input)),
IF (contains(?input, '-'),
IRI(CONCAT('http://purl.uniprot.org/isoform/', ?input)),
IRI(CONCAT('http://purl.uniprot.org/uniprot/', ?input))
)
) AS ?start)
{
?protein a up:Protein .
FILTER(sameTerm(?protein, ?start))
} UNION {
?protein up:annotation ?annotation .
FILTER(sameTerm(?annotation, ?start))
}UNION {
?protein up:sequence ?sequence .
FILTER(sameTerm(?sequence, ?start))
}
?protein up:organism ?organism ;
up:encodedBy/skos:prefLabel ?geneLabel .
}
GROUP BY ?protein ?organism The rest service at [www.uniprot.org] can provide the same answer (probably a bit faster) but there are some edge cases where it might produce the wrong answer for things that look like an UniProt ac but are not. Main unexpected issue in both cases is that you might have more than one gene per UniProt record. if you are happy with the risk of wrong answers then
Here you will need to break the values in the genes column on the first space to get the 'best' gene name. |
Yes - but hopefully a single gene name and species (ie a single entity). Pascale |
@pgaudet there is an 1:n relation of entry to gene name (in rare cases). 1:1 from entry to species. |
But there is a primary name ? |
@pgaudet one uniprot entry can have multiple genes e.g. http://www.uniprot.org/uniprot/Q9ULZ0 (the same is true for neXtProt by the way) See the results of this SPARQL query: for the human swiss-prot entries. PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
SELECT
?protein
(COUNT(?gene) AS ?genes)
FROM <http://sparql.uniprot.org/uniprot>
WHERE
{
?protein a up:Protein ;
up:organism taxon:9606 ;
up:reviewed true ;
up:encodedBy ?gene
}
GROUP BY ?protein
HAVING (COUNT(?gene) > 1) |
Is there a way to query for whether a protein is a GCRP? These should generally have one gene |
S.cerevisiae has several gene pairs (that are part of GCRP) that share a UniProt identifier. |
Hi @pmasson55 ! Would you ever need to annotate to any unreviewed UniProt ID? |
Hello, It may happen but it's very rare. I usually try to gather annotations in the best representative genomes that are well annotated. So, I would say it represents less than 1% of the annotations in conventional GO. |
It is very likely that if a Swiss-Prot curator annotates an unreviewed
UniProt Accession it will become an reviewed UniProt Accession by the next
release.
…On Wed, Apr 18, 2018 at 1:56 PM, pmasson55 ***@***.***> wrote:
Hello,
It may happen but it's very rare. I usually try to gather annotations in
the best representative genomes that are well annotated. So, I would say it
represents less than 1% of the annotations in conventional GO.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#471 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8MFe21oXqRrKYjdDM2zGpm1y0GeAajks5tpynQgaJpZM4O5QBk>
.
--
Jerven Bolleman
[email protected]
|
OK great thanks @pmasson55 and @JervenBolleman ! So that means as long as the reviewed IDs are kept up-to-date and loaded into NEO, most of the desired UniProt IDs will resolve to a label and be available in the autocomplete. |
I'd like to add this to the agenda for our GO-CAM call on Wednesday, April 25th, just to make sure we're all clear on the allowed and expected behaviors for annotating to UniProtKB accessions. Thx. |
@alexsign will look into providing the GPI for all the Swiss-Prot entries. |
@pgaudet Is this now functionally a dupe of geneontology/neo#82? |
Replaced by geneontology/neo#82 |
This could be one of a number of different ways
It would seem 1 is easiest, we should target this first. I believe we discussed this at the geneva noctua workshop and implemented it, but it appears to have gone.
to be discussed further today @kltm @vanaukenk @ukemi @pgaudet
The text was updated successfully, but these errors were encountered: