Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to add more identifiers to GeneProducts in polish #53

Closed
5 tasks done
Tracked by #58
GwennyGit opened this issue Jan 19, 2023 · 5 comments · Fixed by #88
Closed
5 tasks done
Tracked by #58

Add functionality to add more identifiers to GeneProducts in polish #53

GwennyGit opened this issue Jan 19, 2023 · 5 comments · Fixed by #88
Assignees
Labels
enhancement New feature or request refactoring changes in the code functionality

Comments

@GwennyGit
Copy link
Collaborator

GwennyGit commented Jan 19, 2023

Feature:
Extend the polish module to add KEGG gene, UniProt as well as RefSeq identifiers.

Implementation:

  • For bacterial strains where the genome is in the KEGG database:
    • User needs to provide organism code of strain in KEGG through the config.yaml(E.g. For Staphylococcus haemolyticus JCSC1435 this would be sha.)
      → From this code & the locus tags of the model the KEGG gene identifier can be obtained
    • Access KEGG API to retrieve UniProt identifiers
    • Retrieve ncbiprotein identifier additionally from KEGG & Compare ncbiprotein identifier from KEGG with the one in the model
      → No mapping from KEGG Gene to NCBI Protein identifier possible
    • Add all identifiers to the model
  • Add RefSeq identifiers from the Reference Sequence .gff file (Findable at the Genome Assembly site of NCBI, For example implementation see: get_ids_from_gff)
  • Maybe, also get the name & locus tag from the .gff file 🤔 → Not possible for lab_strain
  • Get UniProt identifiers with RefSeq identifiers
    → Result of u = UniProt(), u.mapping('RefSeq_Protein', 'UniProtKB', refseq_id) yields no feasible results. Comparing this result with the result from the KEGG mapping for a protein of Staphylococcus haemolyticus JCSC1435 showed that the resulting identifier is not the same.
@GwennyGit GwennyGit added the enhancement New feature or request label Jan 19, 2023
@GwennyGit GwennyGit added the refactoring changes in the code functionality label Jan 31, 2023
@famosab
Copy link
Collaborator

famosab commented Feb 1, 2023

We already stumbled upon this in another issue (relates to #52), I think that working with the stuff that is implemented in genecomp might already cover many parts that we have here.

@famosab
Copy link
Collaborator

famosab commented Feb 1, 2023

Overview on Identifiers of interest:

  • Genbank: Locus tag & NCBI Protein ID
  • RefSeq: Locus tag, Old locus tag (== Genbank locus tag, not always present) & RefSeq ID
  • Genbank GFF == GFF of e.g. chromosome accession website -> this will hold all info on chromosome and plasmids

@GwennyGit
Copy link
Collaborator Author

If cv_ncbiprotein is adjusted to retrieve identifiers from multiple databases or another function is created for that the identifiers could be stored in a dictionary or table mapping to the database and add_curie_set could also be used here.

GwennyGit added a commit that referenced this issue Jun 22, 2023
- Added filtering for VMH to BiGG identifier mapping
- Added filtering for NaN values in annotations
- Added additional annotation of GeneProducts with KEGG, UniProt & RefSeq identifiers
- Improved handling of RefSeq identifiers
@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Jun 29, 2023

As I had already written a script to enhance the GeneProduct annotations here I decided not to create a table for the annotation enhancement but integrated my script into refineGEMs. The adjustments were tested on two VMH models created with AGORA2 and three models which were previously polished with refineGEMs' former polish function. The results look good.

@GwennyGit
Copy link
Collaborator Author

The changes described here are all merged into dev-2 which is the branch for refineGEMs 2.0.0. Thus, this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactoring changes in the code functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants