Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Branch review table #6648

Draft
wants to merge 55 commits into
base: master
Choose a base branch
from
Draft

Create Branch review table #6648

wants to merge 55 commits into from

Conversation

matentzn
Copy link
Member

@matentzn matentzn commented Sep 15, 2023

@cmungall Here is where SPARQL really suxx. If you can provide me a SQL query to deal with this specific issue, I will start switching to OAK for table reports.

UPDATE:

Using oaklib:

reports/mondo-curation-branch-review.tsv:
	python src/scripts/branch_review.py create-review-table -o $@ -f obsoletion_terms.tsv -B branch_ids.tsv 

OR

reports/mondo-curation-branch-review.tsv:
	python src/scripts/branch_review.py create-review-table -o reports.tsv -f obsoletion_terms.tsv -b MONDO:0005151

Pseudo-code

all_branches = [....list of all branches ...]
branch_dict = { branch_id: branch_id.get_descendants() }

def create_review_table(branch_id, obsoletion_candidate_ids):
    affected_mondo_ids = id.get_direct_children() for id in obsoletion_candidate_ids #these are the classes that we need to review.
    
    for affected_id in affected_mondo_ids:
        # FOR THIS ANALYSIS ONLY CONSIDER MONDO IDS, NOT BFO ETC AS PARENTS
        parents = affected_id.get_direct_parents()
        parents.remove(obsoletion_candidate_ids)
        parents_inside_branch = intersection(parents, branch_dict[branch_id])
        parents_outside_branch = parents.remove(parents_inside_branch)
        ancestors = affected_id.get_ancestors()
        ancestors_inside_branch = intersection(ancestors, branch_dict[branch_id])
        ancestors_outside_branch = parents.remove(ancestors_inside_branch)
        
        if no ancestors: affected_status = ORPHANED
        elif no ancestors_inside_branch = LEFT_THIS_BRANCH
        elif ???? = LEFT_OTHER_BRANCH <- I DONT KNOW HOW TO WRITE THIS BUT LETS DO THIS AFTER EVERYTHING ELSE. FOR NOW JUST: IF ANY(ancestor in all_branches) 
        elif parents_inside_branch = RETAINS_PARENT

        else : RETAINS_ANCESTOR
        
        data = [affected_id, labels(parents_inside_branch), labels(parents_outside_branch), labels(ancestors_inside_branch), labels(ancestors_outside_branch), affected_status] 

def labels(curies):
    return [ f"{c.label} ({c})" for c in curies ]

@hrshdhgd hrshdhgd self-assigned this Sep 15, 2023
Copy link
Member Author

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some, but not all comments, as it's hard to read the code now that the affected entity variable has been introduced.

Can you rename all the variables to say affected_entity instead of obsoletion candidate where it makes sense? And also, instead of "filtered" say how, like "non_obsolete_parents". I will Check again tomorrow

src/scripts/branch_review.py Outdated Show resolved Hide resolved

if len(affected_mondo_ids) > 0:
for obsoletion_candidate in affected_mondo_ids:
parent_of_obsolete_candidate = get_parent_from_relations(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parent_of_obsolete_candidate = get_parent_from_relations(
parents_of_affected = get_parent_from_relations(

for x in parent_of_obsolete_candidate
if x not in affected_mondo_ids
]
filtered_parent_of_obsolete_candidate = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same variAble name as above?

filtered_parent_of_obsolete_candidate = [
x
for x in parent_of_obsolete_candidate
if x not in affected_mondo_ids
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this filter is needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for lines below. I think my understanding is faulty here and may need clarification.

filtered_ancestor_of_obsolete_candidate = [
x
for x in ancestors_of_obsolete_candidate
if x not in affected_mondo_ids and x.startswith("MONDO:")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not innobsoletion candidates

parents_inside_branch = [
i
for i in parents_inside_branch
if i not in filtered_parent_of_obsolete_candidate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter Not needed

parents_outside_branch = [
i
for i in parents_outside_branch
if i not in filtered_parent_of_obsolete_candidate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

@hrshdhgd
Copy link
Collaborator

I committed just refactors to variable names and some list <=> set

@sabrinatoro
Copy link
Collaborator

What is needed:

  • report direct children of the obsoletion candidates
  • you can focus on one branch at a time. Branches can be found here. We are starting with MONDO:0005151 - endocrine system disorder
  • since obsoletion candidates can be at different level within a branch, we need to know whether the direct parents (and ancestors) are obsoletion candidates
    - option 1: remove all parent/ancestor that are obsoletion candidates from the report
    - option 2: rename all obsoletion candidates to include "TO BE OBSOLETED" at the beginning of their label. (this should be done for the whole ontology, so we know that a term might actually loose more than one parent)
  • Please find examples here, included column name suggestions.

@sabrinatoro
Copy link
Collaborator

Reviewing [this spreadsheet]:

  • the obsoletion candidate (colE ) and their children (colB) are not in the branch review (colA)
    For example: obsoletion candidate MONDO:0013742 (colE) is not in the MONDO:0005151 (colA) branch.
    BUT all the children of the obsoletion candidate (ColB) are reported (e.g. there are 2 children for obsoletion candidate MONDO:0013742)

  • disregarding the "in the branch" or "out of the branch" (since this is incorrect because of what I mentioned above), some direct parents are missing. For example:
    MONDO:0008964, should have a total of 4 direct parents: MONDO:0000824 ('congenital diarrhea'); MONDO:0000249 ('secretory diarrhea'), MONDO:0015178 (OBSOLETION CANDIDATE); MONDO:0000824 ('congenital diarrhea')

Note that the first example I look at, the obsoletion candidate MONDO:0008347 was already obsoleted... I am assuming this is not going to pose any problem since we will run this after each obsoletion round. But I still want to mention it to make sure.

@twhetzel
Copy link
Collaborator

twhetzel commented Oct 2, 2023

@sabrinatoro in the second bullet point above:

- disregarding the "in the branch" or "out of the branch" (since this is incorrect because of what I mentioned above), some direct parents are missing. For example:
MONDO:0008964, should have a total of 4 direct parents: MONDO:0000824 ('congenital diarrhea'); MONDO:0000249 ('secretory diarrhea'), MONDO:0015178 (OBSOLETION CANDIDATE); MONDO:0000824 ('congenital diarrhea')

the term MONDO:0000824 ('congenital diarrhea') is duplicated

src/ontology/mondo.Makefile Outdated Show resolved Hide resolved

# Column J: Affected Status = status
status = ""
if len(other_parents_in_branch) > 0 and all(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(other_parents_in_branch) > 0 and all(
if len(other_parents_in_branch) > 0 and any(

This seems to be the last issue in the table!

Copy link
Member Author

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice!

src/scripts/branch_review.py Outdated Show resolved Hide resolved
src/scripts/branch_review.py Outdated Show resolved Hide resolved
if len(other_parents_in_branch) > 0 and any(
" - TO_BE_OBSOLETED" not in parent for parent in other_parents_in_branch
):
status = STAYS_IN_THE_BRANCH
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabrinatoro in my opinion, this status should be renamed to: "Retains parent in branch" for clarity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also want a code that expresses if the class retains an ancestor in he branch, and check that right after this one, so maybe this is a better naming scheme?:

  1. Retains parent in branch (currently STAYS_IN_THE_BRANCH)
  2. Retains ancestor in branch (currently missing)
  3. Leaves branch but retains parent in other branch (currently LEAVES_THE_BRANCH)
  4. Orphaned (leave as is)

@sabrinatoro
Copy link
Collaborator

@hrshdhgd and I reviewed spreadsheets created from different files (included relaxed, inferred, relaxed/inferred). Here is what we found out:

  • "relaxed" does not include parents that are inferred.
  • "inferred" (alone or in combination with "relax") always removes direct parents that are also ancestors.

Solution (that we are testing) is to create 3 spreadsheets:

  • spreadsheet 1: created using the "relaxed" file. --> give ALL direct parent that are asserted
  • spreadsheet 2: created using the "inferred" file --> give all inferred parent
  • spreadsheet 3 = FINAL/combined spreadsheet : combination of spreadsheet 1 and 2 (with removing all redundancy)

note: the "affected status" (orphan, leaves branch, stays in branch) should be calculated based on the information in the "combined" spreadsheet.

@twhetzel
Copy link
Collaborator

@hrshdhgd @matentzn - this is the PR I intended to ask when it will be merged. I think there may also be another separate PR for this Branch Review table as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants