Allow annotation of any UniProtKB identifier #471

cmungall · 2017-08-16T17:44:57Z

This could be one of a number of different ways

allow pasting in of a UniProtKB ID in all the various relevant forms (annoton wizards, add individual, TableMode, complex editor), regardless of whether it is in neo
load all of uniprot into neo
change architecture such that external autocomplete services can be used an client can inject TBox axioms (minimally rdfs:label)

It would seem 1 is easiest, we should target this first. I believe we discussed this at the geneva noctua workshop and implemented it, but it appears to have gone.

to be discussed further today @kltm @vanaukenk @ukemi @pgaudet

kltm · 2017-08-16T17:56:57Z

I'd disagree--the last two seem easier. Number two would work within what we have now, and I think that H left Minerva to take external services. Number one would require a widget-level coordination, contraction, and refactoring that would take quite a bit of effort.

cmungall · 2017-08-23T21:02:16Z

To clarify, 1 and 2 are standalone solutions. solution 1 in isolation requires curators to memorize IDs both for entry and for recognizing IDs in the display.

solution 3 depends on solution 1. The curator still has to inject an ID, but once injected it will be visible via labels in the normal way

cmungall · 2017-08-23T21:07:15Z

Additional clarification: here any uniprotkb ID means both:

so-called canonical IDs, e.g. P12345
isoform IDs, e.g. P12345-2

Of the canonical IDs, we wish to prioritize GCRPs over non-GCRPs. This is certainly the case for any genome in the QfO set. For genomes outside this set, there may be no GCRP defined.

Note that not every GCRP is a swiss-prot ID (although the overlap highest for human)

E.g.
A0A075B7H6 is in the trembl subset but it is a GCRP

cmungall · 2017-08-23T21:32:08Z

if we went route3 and plugged in external APIs for either AC or lookup mygene is the cleanest, but may not be complete: biothings/mygene.info#19

kltm · 2017-08-23T21:43:56Z

Your characterization of 1 is correct from a curator point of view. 1 is conceptually standalone, but would require either a fair amount of client code coordination or the creation of an uber autocomplete (on the wishlist) which then needs to be uniformly rolled out. I believe that we want to get this solved before that would likely be completed.
3 still requires 1's "any id" policy, but users would still possibly have the labels after minerva did a lookup, so that is an improvement.
2 is the cleanest in that we keep everything in-house, which also makes things like label updates, a possibility in 3, pretty much a non-issue.

I'm unsure about the use of mygene out of the box for 3. For example, if I put in "fox2" (http://mygene.info/v3/query?q=fox2), I get a single "close" result for human, whereas amigo gives 3 exact and one partial results and neo gives an SGD exact and a few partials.

cmungall · 2017-08-23T22:40:32Z

I'm unsure about the use of mygene out of the box

good point, it would be better for lookup given an ID rather than for AC

cmungall · 2017-08-23T23:23:09Z

Note to self: behavior with unknown IDs can be tested with M3ExpressionParserTest

kltm · 2017-08-23T23:24:22Z

#471 (comment)
Does that include round-tripping, or just initial ingest?

kltm · 2017-08-23T23:31:03Z

These issues seem to be the last ones we were looking at in this neighborhood:
geneontology/minerva#53
geneontology/minerva#58
We currently run with run-minerva-no-validation, which is: lookup yes and validation: no.

kltm · 2017-08-23T23:44:15Z

As it currently stands, assuming you have entered something in the GP (enabled_by) entry in the "Add annoton" section, free entry/copying and pasting is allowed for all entries.
Everything that is not in the "Add annoton" section is restricted to a selection.

…d to server; extend tests to include tests for no-literal-id checking mode. Reword tests to use smaller ontology in test/resources. Also add a test to ensure that under no circumastances can non-CURIES like "ABC" be passed through as class IRIs See #53, #58 The overall context here is checking we do not have issues when we start to encourage SIB curators to paste in UniProt IDs see geneontology/noctua#471

cmungall · 2017-08-24T00:07:18Z

To summarize:

Number 1 is 'good enough' for a first pass for SIB curators. It turns out that this is possible with the add-annoton wizard. I personally don't use this much, and I've observed a variety of behaviors from different users. I filed #473 which I think will make add-annoton more generally usable.

For the next pass, we should think about labels, but we have some breathing space now (would be happy to close this one and start a new ticket)

…ed to server; extend tests to include tests for no-literal-id checking mode. Reword tests to use smaller ontology in test/resources. Also add a test to ensure that under no circumastances can non-CURIES like "ABC" be passed through as class IRIs See #53, #58 The overall context here is checking we do not have issues when we start to encourage SIB curators to paste in UniProt IDs see geneontology/noctua#471

kltm · 2017-08-24T20:46:21Z

For number 1, any further action will take some time or hinge on #261 .

thomaspd · 2017-09-11T10:30:53Z

I'm bringing @pmasson55 (Swiss-Prot curator!) into this discussion

kltm · 2017-09-11T11:08:24Z

From the BioHackathon, @JervenBolleman also has an interest in this.

pmasson55 · 2017-10-17T09:40:33Z

After few weeks of testing as curator, the current solution of entering UniProt Accessions works as a temporary solution but it’s really limiting for the curation. For any model I create there are multiple viral proteins, and using only UniProtKB: XXXXX, the pathway becomes hard to follow and read after the addition of only few proteins. For usability we need to have protein names.
Moreover for viruses we need more than just the UniProtKB protein names. Since viruses encode proteins as polyproteins that get cleaved post-translationally, the entities we really need to annotate are the “chains”. For a single viral entry, there can be many different chains, each representing an individual functional unit. It would be good to be able to annotate these chains individually (by uploading them also in neo from UniProtKB ?).

krchristie · 2017-10-31T23:49:01Z

I completely agree that we need more than just the UniProtKB protein name, not just in the canvas but also in the Annotation Preview. It defeats the point of the Annotation Preview to have to export the GPAD to be certain that the correct thing has been annotated.

Also, it would be nice if the autocomplete in the Add Annoton wizard worked the same way as in the Add Individual. Specifically Add Individual allowed me to use "Q8N4C6-7" as the text I entered to get autocompletion upon, while entering this text into the Add Annoton wizard only offered me ChEBI terms. In contrast, in the Add Annoton wizard, I had to type in "Nin Hs" in order to get the needed ID offered in the autocomplete list. This text worked equally well in the Add Individual option.

thomaspd · 2018-03-27T23:01:38Z

I wanted to revisit this now, so that we can use the SAE (@tmushayahama ) to annotate any UniProt entry. I would suggest that we allow pasting of an identifier, as is done now. But then we can use the UniProt web services to programmatically retrieve the gene name, species etc. that we can then populate onto the gene product instance. So this information could then be viewed in Noctua. Using web services would be straightforward for the "canonical" UniProt entry, but @pmasson55, can you suggest how to get the relevant information for "chains" from the UniProt web services? What is an example identifier?

pmasson55 · 2018-03-28T06:50:43Z

Hi,
Concerning chains, Poliovirus is a good example :http://www.uniprot.org/uniprot/P03300
If you go on PTM/processing part of the entry, there is a list of all the chains. For example,
Capsid protein VP0 has a chain identifer PRO_0000424688. In ProteintoGO, we usually type the name of the entry followed by the name of the chain. So if we want to annotate this VP0 chain specifically we would have this identifier P03300:PRO_0000424688. For the technical retrieval issue, I can ask Nicole or Jerven.
Hope this help and answer your question.

JervenBolleman · 2018-03-28T07:08:42Z

Just to confirm, the input from UniProt can be an Entry (primary) accession, An isoform accession or a Chain identifier?
The desired output are the gene names and species (NCBI tax id?) associated with this entry?

In which case the following SPARQL query will give just the information required and not anything extra. There are also no cases where this can provide a wrong answer.

Al it requires is replacing the _INPUT_ with the actual value that you want to search with

PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX up:<http://purl.uniprot.org/core/> 
SELECT 
	?protein 
	?organism 
	(GROUP_CONCAT(?geneLabel;separator=" , ") AS ?geneLabels)
WHERE
{
	BIND('_INPUT_' AS ?input)
	BIND(
      IF(strstarts(?input, 'PRO_'),
      	IRI(CONCAT('http://purl.uniprot.org/annotation/', ?input)), 
        IF (contains(?input, '-'), 
            IRI(CONCAT('http://purl.uniprot.org/isoform/', ?input)),
            IRI(CONCAT('http://purl.uniprot.org/uniprot/', ?input))
        )
      ) AS ?start)
  {
    ?protein a up:Protein .
    FILTER(sameTerm(?protein, ?start))
  } UNION {
    ?protein up:annotation ?annotation .
    FILTER(sameTerm(?annotation, ?start))
  }UNION {
    ?protein up:sequence ?sequence .
    FILTER(sameTerm(?sequence, ?start))
  }
  ?protein up:organism ?organism ;
           up:encodedBy/skos:prefLabel ?geneLabel .
}
GROUP BY ?protein ?organism

The rest service at [www.uniprot.org] can provide the same answer (probably a bit faster) but there are some edge cases where it might produce the wrong answer for things that look like an UniProt ac but are not.

Main unexpected issue in both cases is that you might have more than one gene per UniProt record.

if you are happy with the risk of wrong answers then

http://www.uniprot.org/uniprot/?query=id:_INPUT_%20or%20isoform:_INPUT_%20or%20annotation:(id:_INPUT_)&format=tab&columns=genes,organism

Here you will need to break the values in the genes column on the first space to get the 'best' gene name.

pgaudet · 2018-03-28T12:36:16Z

Just to confirm, the input from UniProt can be an Entry (primary) accession, An isoform accession or a Chain identifier?
The desired output are the gene names and species (NCBI tax id?) associated with this entry?

Yes - but hopefully a single gene name and species (ie a single entity).

Pascale

JervenBolleman · 2018-03-28T15:15:53Z

@pgaudet there is an 1:n relation of entry to gene name (in rare cases). 1:1 from entry to species.

pgaudet · 2018-03-28T17:34:01Z

But there is a primary name ?
(in any case that would be only for display, although I am not sure how that would be managed in the back-end).

JervenBolleman · 2018-03-29T14:04:45Z

@pgaudet one uniprot entry can have multiple genes e.g. http://www.uniprot.org/uniprot/Q9ULZ0 (the same is true for neXtProt by the way)

See the results of this SPARQL query: for the human swiss-prot entries.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
SELECT 
	?protein 
	(COUNT(?gene) AS ?genes)
FROM <http://sparql.uniprot.org/uniprot>
WHERE	
{
	?protein a up:Protein ;
               up:organism taxon:9606 ;
               up:reviewed true ;
               up:encodedBy ?gene 
} 
GROUP BY ?protein
HAVING (COUNT(?gene) > 1)

cmungall · 2018-03-29T17:41:57Z

Is there a way to query for whether a protein is a GCRP? These should generally have one gene

srengel · 2018-03-29T19:09:19Z

S.cerevisiae has several gene pairs (that are part of GCRP) that share a UniProt identifier.
ex. HHT1 HHT2
ex. TEF1 TEF2

dustine32 · 2018-04-17T23:20:38Z

Hi @pmasson55 ! Would you ever need to annotate to any unreviewed UniProt ID?

pmasson55 · 2018-04-18T11:55:59Z

Hello,

It may happen but it's very rare. I usually try to gather annotations in the best representative genomes that are well annotated. So, I would say it represents less than 1% of the annotations in conventional GO.

JervenBolleman · 2018-04-18T17:18:58Z

It is very likely that if a Swiss-Prot curator annotates an unreviewed UniProt Accession it will become an reviewed UniProt Accession by the next release.

…

On Wed, Apr 18, 2018 at 1:56 PM, pmasson55 ***@***.***> wrote: Hello, It may happen but it's very rare. I usually try to gather annotations in the best representative genomes that are well annotated. So, I would say it represents less than 1% of the annotations in conventional GO. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#471 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA8MFe21oXqRrKYjdDM2zGpm1y0GeAajks5tpynQgaJpZM4O5QBk> .

-- Jerven Bolleman [email protected]

dustine32 · 2018-04-20T18:20:30Z

OK great thanks @pmasson55 and @JervenBolleman ! So that means as long as the reviewed IDs are kept up-to-date and loaded into NEO, most of the desired UniProt IDs will resolve to a label and be available in the autocomplete.

vanaukenk · 2018-04-23T17:10:56Z

I'd like to add this to the agenda for our GO-CAM call on Wednesday, April 25th, just to make sure we're all clear on the allowed and expected behaviors for annotating to UniProtKB accessions. Thx.

pgaudet · 2019-10-08T23:41:10Z

@alexsign will look into providing the GPI for all the Swiss-Prot entries.

kltm · 2022-04-09T02:29:30Z

@pgaudet Is this now functionally a dupe of geneontology/neo#82?

pgaudet · 2022-04-20T15:38:19Z

Replaced by geneontology/neo#82

kltm added this to the wishlist milestone Aug 16, 2017

kltm added the enhancement label Aug 16, 2017

cmungall mentioned this issue Aug 23, 2017

Make changes to "Add annoton" wizard behavior in graph editor for restrictions and input #473

Closed

cmungall mentioned this issue Aug 24, 2017

Additional tests for expansion of class IDs to class IRIs geneontology/minerva#133

Merged

kltm added the bug (B: affects usability) label Aug 24, 2017

kltm changed the title ~~Allow annotation of any uniprot identifier~~ Allow annotation of any UniProtKB identifier Aug 24, 2017

cmungall mentioned this issue Aug 24, 2017

Allow use of valid identifiers not in autocomplete in simple annoton mode geneontology/noctua-form-legacy#2

Open

kltm mentioned this issue Aug 31, 2017

Autocomplete selection not being recognized in CC box of "Add annoton" #488

Closed

kltm mentioned this issue Sep 12, 2018

Form "has input" field does not recognise UniProt accessions, cannot add multiple inputs #583

Closed

vanaukenk added the annotation entities label May 27, 2020

pgaudet closed this as completed Apr 20, 2022

moghelab mentioned this issue Oct 28, 2022

Adding UniProt IDs? #799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow annotation of any UniProtKB identifier #471

Allow annotation of any UniProtKB identifier #471

cmungall commented Aug 16, 2017

kltm commented Aug 16, 2017

cmungall commented Aug 23, 2017

cmungall commented Aug 23, 2017 •

edited

Loading

cmungall commented Aug 23, 2017

kltm commented Aug 23, 2017

cmungall commented Aug 23, 2017 •

edited

Loading

cmungall commented Aug 23, 2017

kltm commented Aug 23, 2017

kltm commented Aug 23, 2017

kltm commented Aug 23, 2017

cmungall commented Aug 24, 2017

kltm commented Aug 24, 2017

thomaspd commented Sep 11, 2017

kltm commented Sep 11, 2017

pmasson55 commented Oct 17, 2017

krchristie commented Oct 31, 2017

thomaspd commented Mar 27, 2018

pmasson55 commented Mar 28, 2018

JervenBolleman commented Mar 28, 2018 •

edited

Loading

pgaudet commented Mar 28, 2018

JervenBolleman commented Mar 28, 2018

pgaudet commented Mar 28, 2018

JervenBolleman commented Mar 29, 2018

cmungall commented Mar 29, 2018

srengel commented Mar 29, 2018

dustine32 commented Apr 17, 2018

pmasson55 commented Apr 18, 2018

JervenBolleman commented Apr 18, 2018 via email

dustine32 commented Apr 20, 2018

vanaukenk commented Apr 23, 2018

pgaudet commented Oct 8, 2019

kltm commented Apr 9, 2022

pgaudet commented Apr 20, 2022

Allow annotation of any UniProtKB identifier #471

Allow annotation of any UniProtKB identifier #471

Comments

cmungall commented Aug 16, 2017

kltm commented Aug 16, 2017

cmungall commented Aug 23, 2017

cmungall commented Aug 23, 2017 • edited Loading

cmungall commented Aug 23, 2017

kltm commented Aug 23, 2017

cmungall commented Aug 23, 2017 • edited Loading

cmungall commented Aug 23, 2017

kltm commented Aug 23, 2017

kltm commented Aug 23, 2017

kltm commented Aug 23, 2017

cmungall commented Aug 24, 2017

kltm commented Aug 24, 2017

thomaspd commented Sep 11, 2017

kltm commented Sep 11, 2017

pmasson55 commented Oct 17, 2017

krchristie commented Oct 31, 2017

thomaspd commented Mar 27, 2018

pmasson55 commented Mar 28, 2018

JervenBolleman commented Mar 28, 2018 • edited Loading

pgaudet commented Mar 28, 2018

JervenBolleman commented Mar 28, 2018

pgaudet commented Mar 28, 2018

JervenBolleman commented Mar 29, 2018

cmungall commented Mar 29, 2018

srengel commented Mar 29, 2018

dustine32 commented Apr 17, 2018

pmasson55 commented Apr 18, 2018

JervenBolleman commented Apr 18, 2018 via email

dustine32 commented Apr 20, 2018

vanaukenk commented Apr 23, 2018

pgaudet commented Oct 8, 2019

kltm commented Apr 9, 2022

pgaudet commented Apr 20, 2022

cmungall commented Aug 23, 2017 •

edited

Loading

cmungall commented Aug 23, 2017 •

edited

Loading

JervenBolleman commented Mar 28, 2018 •

edited

Loading