Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonym Sync: MONDO:GENERATED edge cases #745

Open
joeflack4 opened this issue Jan 10, 2025 · 10 comments
Open

Synonym Sync: MONDO:GENERATED edge cases #745

joeflack4 opened this issue Jan 10, 2025 · 10 comments
Assignees
Labels
bug Something isn't working needs discussion

Comments

@joeflack4
Copy link
Contributor

joeflack4 commented Jan 10, 2025

Overview & background

Trish discovered an -added synonym "gvhd -" that did not appear (icd11foundation:437372167) in the source.

That's because this is a MONDO:GENERATED synonym. But the robot templates don't include that information in the synonym_type column because we had decided we did not want to import MONDO:GENERATED synonymType annotations.

There's also the issue that the generated "gvhd -" does not appear quite as we would hope. There are two issues with it:

  1. It has a trailing -
  2. It's an acronym, but fix-labels-with-brackets.ru doesn't realize this, so it lowercases it.

Sub-tasks

From Trish - The first sub-task is to understand why this is happening, what the synonym should look like, and the various options that can be used to fix the issue. There are many steps along the way of processing ICD11 where this could be resolved. I don't see either of these as the way to go at this point.

- [ ] 1. Consider adding additional, curator-only column is_mondo_generated
- Alternatively we could just import MONDO:GENERATED as a synonym type, but we decided not to before.
- [ ] 2. Decide whether curation is needed for MONDO:GENERATED synonyms before we import them
- If not, decide whether or not to do anything programmatically to prevent these kinds of garbled synonyms from appearing.

@joeflack4 joeflack4 self-assigned this Jan 10, 2025
@joeflack4 joeflack4 added bug Something isn't working needs discussion labels Jan 10, 2025
@twhetzel
Copy link
Contributor

@joeflack4 the pipeline should not be generating a synonym "gvhd -".

@joeflack4
Copy link
Contributor Author

@twhetzel I'll add this to our next Thursday agenda. But if we have time, maybe we can talk about it at the tech call. IDK if Nico will be here, but he designed fix-labels-with-brackets.ru.

@joeflack4
Copy link
Contributor Author

We have several options:
a. Just stop doing this query and altering these synonyms; import them as is.
b. Import them as is, but also mutate them, but fix the mutation.

The original synonym is: GVHD - [graft-versus-host disease]

We need to remove patterns of - [TEXT], then remove patterns of just [TEXT]. This could involve updating the existing query, or having another query run right before it.

Note that current the regex for this matches on strings ending with this pattern.
So, currently, GVHD - [graft-versus-host disease] gets mutated, but GVH - [graft-versus-host] disease does not.

@twhetzel
Copy link
Contributor

The query needs to stay because it does fix a value for one of the external ontologies that is processed. As is, it's causing problems for icd11.foundation.

@matentzn
Copy link
Member

I would probably try and improve the query a bit. GVHD - [graft-versus-host disease] is really two synonyms. From a user perspective I'd want to see GHVD as an acronym, maybe this can someone be done in sparql with a conditional bind

@twhetzel
Copy link
Contributor

@matentzn given the synonyms for Graft-versus-host disease which are:

  • GVHD - [graft-versus-host disease]
  • graft-versus-host reaction or disease
  • GVH - [graft-versus-host] disease
  • GVH - [graft-versus-host] reaction

are you suggesting that any synonyms that contains " - " should be split and treated as two synonyms?

As compared to modifying or using a different sparql query than fix-labels-with-brackets.ru to convert GVH - [graft-versus-host] disease to GVH - graft-versus-host disease?

@matentzn
Copy link
Member

I have not really though this through; I am thinking of it from an NLP perspective more than anything else.

I would want to see the following synonyms

  • GVHD
  • graft-versus-host disease
  • graft-versus-host reaction
  • graft-versus-host reaction or disease

I don't so much see the purpose for

  • GVHD - [graft-versus-host disease]
  • GVH - [graft-versus-host] reaction
  • GVH - [graft-versus-host] disease

Where both the acronym and the spelt out name are combined.

That said, however, there is value keeping the synonyms exactly as the source has it, so there is trade offs. The most comprehensive would be to add these as synonyms:

  • GVHD (credit only with MONDO:ICD11, and make sure its synonym type acronym)
  • graft-versus-host disease (credit only with MONDO:ICD11)
  • graft-versus-host reaction (credit only with MONDO:ICD11)
  • graft-versus-host reaction or disease (credit only with MONDO:ICD11)
  • GVHD - [graft-versus-host disease] (credit with icd11.foundation:437372167, as this is the synonym "as is")
  • GVH - [graft-versus-host] reaction (credit with icd11.foundation:437372167)
  • GVH - [graft-versus-host] disease (credit with icd11.foundation:437372167)

Just suggestions. If you feel this is too much right now, just drop the query from ICD11 processing?

@twhetzel
Copy link
Contributor

Yes, as a first pass to make sure the "-added" synonyms from the Synonym Sync pipeline can be included in the February Mondo release I would like to keep this simple and if removing the fix-labels-with-brackets.ru from being used on icd11 resolves the issue and does not create new issues for icd11 I prefer to go with that option.

For a future Mondo release (March?), if we want to explore if another query would provide a more meaningful and comprehensive list of synonyms for ICD11 and/or any other ontology that also sounds fine. That will take a little more coding work and more curation review so I would rather not push for this happen this week. I do want to make sure that the "-added" synonym content is ready overall to be added into Mondo for the February release.

@joeflack4 joeflack4 changed the title Synonym Sync: MONDO:GENERATED Synonym Sync: MONDO:GENERATED exceptions Jan 13, 2025
@joeflack4 joeflack4 changed the title Synonym Sync: MONDO:GENERATED exceptions Synonym Sync: MONDO:GENERATED edge cases Jan 13, 2025
@joeflack4
Copy link
Contributor Author

I like the solution of for now just removing fix-labels-with-brackets.ru for icd11.foundation:

There's probably other patterns to consider too, not just [TEXT] and - [TEXT], which by themselves give us much to consider it seems.

@twhetzel
Copy link
Contributor

twhetzel commented Jan 17, 2025

After reviewing the "ADDED" synonyms file, there were some ICD11 formatting issues where text like (OMIM ######) was added after the synonym. We discussed this in the ICD11 Slack and it was decided this MIM information can be removed from the synonym label. This PR contains a query and addition of the query to the processing of ICD11 to fix this synonym label formatting. The branch Updated-bugfix-icd11-generated-synonyms is intended to be merged into bugfix-icd11-generated-synonyms.

The full build from PR #756 is #757. If we agree the build looks good, then #756 can be merged into #749.

Examples of changed synonym labels after processing ICD11 with the query:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs discussion
Projects
None yet
Development

No branches or pull requests

3 participants