Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include viruses and bacteria in NEO #77

Closed
pgaudet opened this issue Jan 27, 2022 · 26 comments
Closed

Include viruses and bacteria in NEO #77

pgaudet opened this issue Jan 27, 2022 · 26 comments
Assignees

Comments

@pgaudet
Copy link

pgaudet commented Jan 27, 2022

The file is here:

http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi

@kltm please let me know if you need more information.

Thanks, Pascale

@kltm
Copy link
Member

kltm commented Jan 27, 2022

Noting that this is ~350k line uncompressed GPI 1.2. Inclusion is like a 25% increase.

@kltm
Copy link
Member

kltm commented Jan 27, 2022

I believe this could be hacked in like

neo/Makefile

Lines 41 to 45 in 10210c1

# BUG: temporary hardcode until https://github.com/geneontology/go-site/issues/1431 is resolved and stable GPI URL is established
mirror/goa_sars-cov-2.gpi.gz:
wget --no-check-certificate https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi -O mirror/goa_sars-cov-2.gpi && gzip mirror/goa_sars-cov-2.gpi
target/neo-goa_sars-cov-2.obo: mirror/goa_sars-cov-2.gpi.gz
gzip -dc $< | ./gpi2obo.pl -s Scov2 -n sars-cov-2 > $@.tmp && mv $@.tmp $@

Probably best to time the addition after the next update cycle.

@kltm
Copy link
Member

kltm commented Jan 27, 2022

@pgaudet Would it be possible to get this as a compressed file from upstream like the others, for consistency and size?

@pgaudet
Copy link
Author

pgaudet commented Jan 28, 2022

@alexsign Can you please provide this data as a compressed file like the others GPIs?

Thanks, Pascale

@alexsign
Copy link

@pgaudet file is gziped now and will be compressed in the future releases

@kltm
Copy link
Member

kltm commented Jan 28, 2022

@alexsign Great, thank you.

@kltm
Copy link
Member

kltm commented Jan 29, 2022

@cmungall Part of the Makefile is running gpi2obo.pl, which would like arguments for species name and ontology id. If these are not provided, they essentially default to "generic". What would be good values in this case?

@cmungall
Copy link
Member

Suffix with the taxon ID for now. Obviously this is not super-friendly but we should progress incrementally. It's better to have some disambguator than autocomplete flooded by 1000 rplNs

When we rewrite my hacky old scripts from perl to python we will fix the whole naming strategy

@kltm
Copy link
Member

kltm commented Jan 29, 2022

@cmungall Clarifying work: I'll extend gpi2obo.pl so that when a flag is on (for this case) the usual default value for species name is replaced with the taxon id.

What about for ontology id then?

@cmungall
Copy link
Member

@alexsign: thanks for doing this, awesome!

Can you populate the properties field? I assume all should have db_subset=Swiss-Prot

@kltm: Should we not document this here: https://github.com/geneontology/go-site/blob/master/metadata/datasets/goa.yaml

together with inclusion/exclusion criteria (I assume this is only SP)

@cmungall
Copy link
Member

What about for ontology id then?
uniprot_reviewed_virus_bacteria.{obo,owl}

@cmungall
Copy link
Member

Just want to record the implications here:

  • unreviewed proteins will not be included, even if the bacteria is a reference proteome
  • when we switch to the reference species set for GO synced with panther, the majority of these will not be present (unless we extend reference species to include these ~5k species)

This is fine, no discussion necessary, just recording this here in case there is any confusion later

@kltm
Copy link
Member

kltm commented Jan 29, 2022

@cmungall Yes, I thought about that, but:

  • we can't really fill in species and taxon, which seemed odd to me (although still technically passing schema validation)
  • I'm a little nervous about adding something oddly named and not normally handled into a main GO pipline metadata file as I'm not completely understanding all of the exceptions and handling rules around goa (and there are a lot)
  • it's only for NEO at this point, so bolting it in like we did for sars-cov2 seemed expedient

I'm happy to go a more "normal" path as well, but would need to move a little slower.

@cmungall
Copy link
Member

we can't really fill in species and taxon, which seemed odd to me (although still technically passing schema validation)

well we could list all 6k taxa in the yaml, but I agree this is suboptimal

I'm a little nervous about adding something oddly named and not normally handled into a main GO pipline metadata file as I'm not completely understanding all of the exceptions and handling rules around goa (and there are a lot)
it's only for NEO at this point, so bolting it in like we did for sars-cov2 seemed expedient

totally fair, let's just proceed for now

kltm added a commit that referenced this issue Jan 29, 2022
…filter feature to script to fill in taxon id in some cases; work on #77
@kltm
Copy link
Member

kltm commented Jan 29, 2022

@cmungall Locally tested PR that may be able to close this issue here #79 .
If taken, this would go live ~next Friday, unless people want this sooner. Tagging @pgaudet @vanaukenk

@alexsign
Copy link

@kltm @cmungall I added db_subset=Swiss-Prot in the code. The updated file should be available in a week time with the new GOA release data. please let me know if you need it sooner.

@kltm
Copy link
Member

kltm commented Feb 2, 2022

Currently running full post-merge test.

@vanaukenk
Copy link

@kltm - do we need to do any testing on the Noctua autocompletes?

@kltm
Copy link
Member

kltm commented Feb 2, 2022

From a discussion w/ @cmungall yesterday, I wanted to try and get a file product that could be eyeballed. A major concern was that this could flood out other things (a 25% increase in size with ~350k entities). While I'm testing the product production now, we could defer rolling this out until there is somebody available to take a look at it live.

@vanaukenk
Copy link

It'd be good to have the Swiss-Prot curators test for ids they'd expect to curate, and I'm happy to do other id testing just in case.

@pgaudet
Copy link
Author

pgaudet commented Feb 3, 2022

Where can this be tested? Is this on Noctua or on some test server?

@kltm
Copy link
Member

kltm commented Feb 3, 2022

@pgaudet This would be tested by running the pipeline, looking at the results, then apply to the autocomplete server (reverting if we don't like it). That said, this is currently blocked by #80 .

@pgaudet
Copy link
Author

pgaudet commented Feb 8, 2022

Just confirmed with @pmasson55 that the Swiss-Prot reviewed is OK (for the record also, in response to #77 (comment))

@kltm
Copy link
Member

kltm commented Feb 9, 2022

@kltm
Copy link
Member

kltm commented Feb 9, 2022

Talking to @vanaukenk , we'll be temporarily switching back to ecocyc to get a NEO release out before continuing work.

@pgaudet
Copy link
Author

pgaudet commented Feb 23, 2022

This is now a dupe of #82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants