Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow annotation of any UniProtKB identifier #471

Closed
3 tasks
cmungall opened this issue Aug 16, 2017 · 33 comments
Closed
3 tasks

Allow annotation of any UniProtKB identifier #471

cmungall opened this issue Aug 16, 2017 · 33 comments

Comments

@cmungall
Copy link
Member

This could be one of a number of different ways

  1. allow pasting in of a UniProtKB ID in all the various relevant forms (annoton wizards, add individual, TableMode, complex editor), regardless of whether it is in neo
  2. load all of uniprot into neo
  3. change architecture such that external autocomplete services can be used an client can inject TBox axioms (minimally rdfs:label)

It would seem 1 is easiest, we should target this first. I believe we discussed this at the geneva noctua workshop and implemented it, but it appears to have gone.

to be discussed further today @kltm @vanaukenk @ukemi @pgaudet

@kltm
Copy link
Member

kltm commented Aug 16, 2017

I'd disagree--the last two seem easier. Number two would work within what we have now, and I think that H left Minerva to take external services. Number one would require a widget-level coordination, contraction, and refactoring that would take quite a bit of effort.

@kltm kltm added this to the wishlist milestone Aug 16, 2017
@cmungall
Copy link
Member Author

To clarify, 1 and 2 are standalone solutions. solution 1 in isolation requires curators to memorize IDs both for entry and for recognizing IDs in the display.

solution 3 depends on solution 1. The curator still has to inject an ID, but once injected it will be visible via labels in the normal way

@cmungall
Copy link
Member Author

cmungall commented Aug 23, 2017

Additional clarification: here any uniprotkb ID means both:

  • so-called canonical IDs, e.g. P12345
  • isoform IDs, e.g. P12345-2

Of the canonical IDs, we wish to prioritize GCRPs over non-GCRPs. This is certainly the case for any genome in the QfO set. For genomes outside this set, there may be no GCRP defined.

Note that not every GCRP is a swiss-prot ID (although the overlap highest for human)

E.g.
A0A075B7H6 is in the trembl subset but it is a GCRP

@cmungall
Copy link
Member Author

if we went route3 and plugged in external APIs for either AC or lookup mygene is the cleanest, but may not be complete: biothings/mygene.info#19

@kltm
Copy link
Member

kltm commented Aug 23, 2017

Your characterization of 1 is correct from a curator point of view. 1 is conceptually standalone, but would require either a fair amount of client code coordination or the creation of an uber autocomplete (on the wishlist) which then needs to be uniformly rolled out. I believe that we want to get this solved before that would likely be completed.
3 still requires 1's "any id" policy, but users would still possibly have the labels after minerva did a lookup, so that is an improvement.
2 is the cleanest in that we keep everything in-house, which also makes things like label updates, a possibility in 3, pretty much a non-issue.

I'm unsure about the use of mygene out of the box for 3. For example, if I put in "fox2" (http://mygene.info/v3/query?q=fox2), I get a single "close" result for human, whereas amigo gives 3 exact and one partial results and neo gives an SGD exact and a few partials.

@cmungall
Copy link
Member Author

cmungall commented Aug 23, 2017

I'm unsure about the use of mygene out of the box

good point, it would be better for lookup given an ID rather than for AC

@cmungall
Copy link
Member Author

Note to self: behavior with unknown IDs can be tested with M3ExpressionParserTest

@kltm
Copy link
Member

kltm commented Aug 23, 2017

#471 (comment)
Does that include round-tripping, or just initial ingest?

@kltm
Copy link
Member

kltm commented Aug 23, 2017

These issues seem to be the last ones we were looking at in this neighborhood:
geneontology/minerva#53
geneontology/minerva#58
We currently run with run-minerva-no-validation, which is: lookup yes and validation: no.

@kltm
Copy link
Member

kltm commented Aug 23, 2017

As it currently stands, assuming you have entered something in the GP (enabled_by) entry in the "Add annoton" section, free entry/copying and pasting is allowed for all entries.
Everything that is not in the "Add annoton" section is restricted to a selection.

cmungall added a commit to geneontology/minerva that referenced this issue Aug 23, 2017
…d to server;

extend tests to include tests for no-literal-id checking mode.
Reword tests to use smaller ontology in test/resources.

Also add a test to ensure that under no circumastances can
non-CURIES like "ABC" be passed through as class IRIs
See #53, #58

The overall context here is checking we do not have issues when we
start to encourage SIB curators to paste in UniProt IDs
see geneontology/noctua#471
@cmungall
Copy link
Member Author

To summarize:

Number 1 is 'good enough' for a first pass for SIB curators. It turns out that this is possible with the add-annoton wizard. I personally don't use this much, and I've observed a variety of behaviors from different users. I filed #473 which I think will make add-annoton more generally usable.

For the next pass, we should think about labels, but we have some breathing space now (would be happy to close this one and start a new ticket)

cmungall added a commit to geneontology/minerva that referenced this issue Aug 24, 2017
…ed to server;

    extend tests to include tests for no-literal-id checking mode.
    Reword tests to use smaller ontology in test/resources.

    Also add a test to ensure that under no circumastances can
    non-CURIES like "ABC" be passed through as class IRIs
    See #53, #58

    The overall context here is checking we do not have issues when we
    start to encourage SIB curators to paste in UniProt IDs
    see geneontology/noctua#471
@kltm kltm changed the title Allow annotation of any uniprot identifier Allow annotation of any UniProtKB identifier Aug 24, 2017
@kltm
Copy link
Member

kltm commented Aug 24, 2017

For number 1, any further action will take some time or hinge on #261 .

@thomaspd
Copy link

I'm bringing @pmasson55 (Swiss-Prot curator!) into this discussion

@kltm
Copy link
Member

kltm commented Sep 11, 2017

From the BioHackathon, @JervenBolleman also has an interest in this.

@pmasson55
Copy link

After few weeks of testing as curator, the current solution of entering UniProt Accessions works as a temporary solution but it’s really limiting for the curation. For any model I create there are multiple viral proteins, and using only UniProtKB: XXXXX, the pathway becomes hard to follow and read after the addition of only few proteins. For usability we need to have protein names.
Moreover for viruses we need more than just the UniProtKB protein names. Since viruses encode proteins as polyproteins that get cleaved post-translationally, the entities we really need to annotate are the “chains”. For a single viral entry, there can be many different chains, each representing an individual functional unit. It would be good to be able to annotate these chains individually (by uploading them also in neo from UniProtKB ?).

@krchristie
Copy link

I completely agree that we need more than just the UniProtKB protein name, not just in the canvas but also in the Annotation Preview. It defeats the point of the Annotation Preview to have to export the GPAD to be certain that the correct thing has been annotated.

Also, it would be nice if the autocomplete in the Add Annoton wizard worked the same way as in the Add Individual. Specifically Add Individual allowed me to use "Q8N4C6-7" as the text I entered to get autocompletion upon, while entering this text into the Add Annoton wizard only offered me ChEBI terms. In contrast, in the Add Annoton wizard, I had to type in "Nin Hs" in order to get the needed ID offered in the autocomplete list. This text worked equally well in the Add Individual option.

@thomaspd
Copy link

I wanted to revisit this now, so that we can use the SAE (@tmushayahama ) to annotate any UniProt entry. I would suggest that we allow pasting of an identifier, as is done now. But then we can use the UniProt web services to programmatically retrieve the gene name, species etc. that we can then populate onto the gene product instance. So this information could then be viewed in Noctua. Using web services would be straightforward for the "canonical" UniProt entry, but @pmasson55, can you suggest how to get the relevant information for "chains" from the UniProt web services? What is an example identifier?

@pmasson55
Copy link

Hi,
Concerning chains, Poliovirus is a good example :http://www.uniprot.org/uniprot/P03300
If you go on PTM/processing part of the entry, there is a list of all the chains. For example,
Capsid protein VP0 has a chain identifer PRO_0000424688. In ProteintoGO, we usually type the name of the entry followed by the name of the chain. So if we want to annotate this VP0 chain specifically we would have this identifier P03300:PRO_0000424688. For the technical retrieval issue, I can ask Nicole or Jerven.
Hope this help and answer your question.

@JervenBolleman
Copy link

JervenBolleman commented Mar 28, 2018

Just to confirm, the input from UniProt can be an Entry (primary) accession, An isoform accession or a Chain identifier?
The desired output are the gene names and species (NCBI tax id?) associated with this entry?

In which case the following SPARQL query will give just the information required and not anything extra. There are also no cases where this can provide a wrong answer.

Al it requires is replacing the _INPUT_ with the actual value that you want to search with

PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX up:<http://purl.uniprot.org/core/> 
SELECT 
	?protein 
	?organism 
	(GROUP_CONCAT(?geneLabel;separator=" , ") AS ?geneLabels)
WHERE
{
	BIND('_INPUT_' AS ?input)
	BIND(
      IF(strstarts(?input, 'PRO_'),
      	IRI(CONCAT('http://purl.uniprot.org/annotation/', ?input)), 
        IF (contains(?input, '-'), 
            IRI(CONCAT('http://purl.uniprot.org/isoform/', ?input)),
            IRI(CONCAT('http://purl.uniprot.org/uniprot/', ?input))
        )
      ) AS ?start)
  {
    ?protein a up:Protein .
    FILTER(sameTerm(?protein, ?start))
  } UNION {
    ?protein up:annotation ?annotation .
    FILTER(sameTerm(?annotation, ?start))
  }UNION {
    ?protein up:sequence ?sequence .
    FILTER(sameTerm(?sequence, ?start))
  }
  ?protein up:organism ?organism ;
           up:encodedBy/skos:prefLabel ?geneLabel .
}
GROUP BY ?protein ?organism

The rest service at [www.uniprot.org] can provide the same answer (probably a bit faster) but there are some edge cases where it might produce the wrong answer for things that look like an UniProt ac but are not.

Main unexpected issue in both cases is that you might have more than one gene per UniProt record.

if you are happy with the risk of wrong answers then

http://www.uniprot.org/uniprot/?query=id:_INPUT_%20or%20isoform:_INPUT_%20or%20annotation:(id:_INPUT_)&format=tab&columns=genes,organism

Here you will need to break the values in the genes column on the first space to get the 'best' gene name.

@pgaudet
Copy link

pgaudet commented Mar 28, 2018

Just to confirm, the input from UniProt can be an Entry (primary) accession, An isoform accession or a Chain identifier?
The desired output are the gene names and species (NCBI tax id?) associated with this entry?

Yes - but hopefully a single gene name and species (ie a single entity).

Pascale

@JervenBolleman
Copy link

@pgaudet there is an 1:n relation of entry to gene name (in rare cases). 1:1 from entry to species.

@pgaudet
Copy link

pgaudet commented Mar 28, 2018

But there is a primary name ?
(in any case that would be only for display, although I am not sure how that would be managed in the back-end).

@JervenBolleman
Copy link

@pgaudet one uniprot entry can have multiple genes e.g. http://www.uniprot.org/uniprot/Q9ULZ0 (the same is true for neXtProt by the way)

See the results of this SPARQL query: for the human swiss-prot entries.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
SELECT 
	?protein 
	(COUNT(?gene) AS ?genes)
FROM <http://sparql.uniprot.org/uniprot>
WHERE	
{
	?protein a up:Protein ;
               up:organism taxon:9606 ;
               up:reviewed true ;
               up:encodedBy ?gene 
} 
GROUP BY ?protein
HAVING (COUNT(?gene) > 1)

@cmungall
Copy link
Member Author

Is there a way to query for whether a protein is a GCRP? These should generally have one gene

@srengel
Copy link

srengel commented Mar 29, 2018

S.cerevisiae has several gene pairs (that are part of GCRP) that share a UniProt identifier.
ex. HHT1 HHT2
ex. TEF1 TEF2

@dustine32
Copy link
Contributor

Hi @pmasson55 ! Would you ever need to annotate to any unreviewed UniProt ID?

@pmasson55
Copy link

Hello,

It may happen but it's very rare. I usually try to gather annotations in the best representative genomes that are well annotated. So, I would say it represents less than 1% of the annotations in conventional GO.

@JervenBolleman
Copy link

JervenBolleman commented Apr 18, 2018 via email

@dustine32
Copy link
Contributor

OK great thanks @pmasson55 and @JervenBolleman ! So that means as long as the reviewed IDs are kept up-to-date and loaded into NEO, most of the desired UniProt IDs will resolve to a label and be available in the autocomplete.

@vanaukenk
Copy link

I'd like to add this to the agenda for our GO-CAM call on Wednesday, April 25th, just to make sure we're all clear on the allowed and expected behaviors for annotating to UniProtKB accessions. Thx.

@pgaudet
Copy link

pgaudet commented Oct 8, 2019

@alexsign will look into providing the GPI for all the Swiss-Prot entries.

@kltm
Copy link
Member

kltm commented Apr 9, 2022

@pgaudet Is this now functionally a dupe of geneontology/neo#82?

@pgaudet
Copy link

pgaudet commented Apr 20, 2022

Replaced by geneontology/neo#82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants