-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load all Swiss-Prot entries in NEO #82
Comments
Full URL is |
See my latest comments in geneontology/go-site#1431 I think loading the reviewed file for SARS-CoV-2 is a bad idea as we lose the important proteins that do important work I suspect this problem would remain for other viruses too, I have no idea how we would do useful annotation of them without entries for the polyproteins. We have fixed the problem for SARS2 with my curated file. However, if we are serious about doing other viruses that have similar genomes then I think we need to programmatically extract the correct entries. This would be a project:
|
@pmasson55 says that this is not typical for all viruses. With Patrick we should look at which viruses need this special processing. Thanks Pascale |
Hi All, I was talking with Peter D'eustachio about this and have two comments that hopefully will be of use.
|
Hi All, Concerning SwissProt viral entries, I would say it concerns about 10% of the total viral entries ( about 1500 out of 15 000 approximately). They are not as complex as SARS-COV-2 entries. Most of the time there is only one polyprotein and not a long and a short version of the same polyprotein. So I think that if we can handle protein processing (being able to annotate chains inside polyproteins) I guess we cover 99% of the viruses.... |
Okay, picking up work from #77 here, where there are a few more details. Noting that the working branch is now: https://github.com/geneontology/neo/tree/issue-82-add-all-reviewed . The current blocking issue is that while we were hoping to have a drop-in replacement work, there is some issue with the owltools solr loader that is preventing a load completion. Essentially, after somewhere between ~500k-~1m documents loaded, we get an error like:
After running this several times, the error occurs usually between two and three hours in to what should be an approximately five hour load, given the number of documents. Note that these initial numbers are from #77, where the full number of documents would have been 1520942 (compared to our current load of 1168920 documents). Given that we know that solr can typically handle many more documents (in the main go pipeline) and is being loaded in batch anyways, it feels to me unlikely that it is solr choking out directly. I suspect that there is some kind of memory handling issue or incorrectly passed parameter to the owltools loader that eventually causes memory thrashing and then the error. As a next step, I'll rerun this and make note of memory and disk usage as it approaches the limit. If it is not in owltools directly, this should still give us information about where to look next. |
Talking to @pgaudet we'll be asking upstream to filter out the sars-cov-2 entries. |
Okay, I'm managed to spend a little time with this and have some observations:
All told (unless I just happened to be stupendously lucky this time), I think that the issue is that owltools can do one or the other with the memory given, but will eventually thrash out if it tries to do both. I think the most expedient next steps would be:
|
Okay, I'm trying to just add in again the uniprot_reviewed to what we have (bumping ecocyc out for the moment). With that, we're still having problems like we've had before (i.e. #80 ) with:
Taking a look at the files:
@balhoff I'm betting there will be a lot of collisions like this and getting them on a one-by-one basis will take a long time. Is there a way to just have these clobber or skip, or do we need to write a filter script to take care of these up front? |
I suggest making a new issue for this and coordinating with Alex For the goa_human vs goa_human_isoform issue: the uniprot files are a bit different from the rest, the GPI specs are AFAIK silent on the matter of how a set of GPs should be partitioned across files, but I would strongly recommend making it a requirement that for GPIs loaded into Neo that uniqueness should be guaranteed. For uniprot this means EITHER
My preference would for 2 I suggest a uniprot-specific one-line script up front that reports and filters any line in goa_X_isoform that does not follow For uniprot_reviewed, I think the easiest thing is to filter out any already-covered taxon |
Apparently a lot of overlap in the first pass with species we already have:
Will bolt this in and see if there are any collisions left. |
… we don't get files we need deleted before we use them (specifically datasets.json); for geneontology/neo#82
Breakup pipeline command from
|
… things that are not present in the datasets.json; work on #82
Okay, I think we're getting a little further along with the collisions. Added an additional manual filter list to pick up the things that are "manual" in the Makefile (not datasets.json). Temporary; seeing if that can get us through the owltools conversion. |
@pgaudet @vanaukenk
To see how this looks, I've put it on to amigo-staging: The load we currently have, for comparison, is here: |
Thanks for the update @kltm |
I dont understand where these links go - did you want to show entities? I dont know how to get to entities from there. |
@vanaukenk |
@vanaukenk My understanding for the moment was that we were going to start out initially with the taxon id and then iterate from there. @pgaudet Those links go to the two NEO loads, as seen through the AmiGO ontology interface; one for the newer load we're experimenting with and one for the current load. Remember to remove the "GO" filter to see all the entities available. |
Shout out to @cmungall for finding this. In the newest NEO load (and maybe some of these are in the older one), at the bottom is a list of kinds of entities that were not correctly converted to CURIEs--1350337 in total. Some of those are probably not practically important as nobody would be curating to them, but some seem important:
Samples of complete list:
@balhoff @cmungall Is this something where owltools needs a different CURIE map? Post filter? Or is this better handed by circling back to #83? |
|
Now have geneontology/go-annotation#4105 and #88 to trace entities. |
From managers' discussion, this is now live. |
Hi @kltm
The 'ultimate' goal is to have all Swiss-Prot (reviewed) entries. The file is in the same GOA ftp, it's called
uniprot_reviewed.gpi.gz
The bacteria and viruses file was to test a smaller set, but we'll need everything. This file is about double the size of uniprot_reviewed_virus_bacteria.gpi.gz.
Thanks, Pascale
The text was updated successfully, but these errors were encountered: