Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to maintenance script / ensembl push #1646

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

andrewkern
Copy link
Member

@andrewkern andrewkern commented Jan 7, 2025

This PR addresses (closes?) #1521.

think i've finally cleaned this up a bit. this PR overhauls the update-genome-data portion of the maintenance script such that:

  • species whose genomes were manually created are skipped
  • species where the ensembl Rest API returns mismatching chromosome names with the current release are skipped
  • detailed logging warnings of these issues are created, along with a final report of which species were skipped in the update and why.

the final report looks like this:

=== Species Update Summary ===
2025-01-07 14:34:46,744 [91493] WARNING  maint: The following species were not updated:
2025-01-07 14:34:46,744 [91493] WARNING  maint:   - AnoCar (Ensembl ID: anolis_carolinensis):
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Existing chromosomes: ['1', '2', '3', '4', '5', '6', 'LGa', 'LGb', 'LGc', 'LGd', 'LGf', 'LGg', 'LGh', 'MT']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '2', '3', '4', '5', '6', 'a', 'b', 'c', 'd', 'f', 'g', 'h']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - CanFam (Ensembl ID: canis_lupus_familiaris):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - DroSec (Ensembl ID: drosophila_sechellia): Manually created genome data file
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - GasAcu (Ensembl ID: gasterosteus_aculeatus):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9', 'MT', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['I', 'II', 'III', 'IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XIX', 'XV', 'XVI', 'XVII', 'XVIII', 'XX', 'XXI', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - HelMel (Ensembl ID: heliconius_melpomene):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: []
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PanTro (Ensembl ID: pan_troglodytes):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PonAbe (Ensembl ID: pongo_abelii):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - StrAga (Ensembl ID: streptococcus_agalactiae_GCA_001017915):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1']
2025-01-07 14:34:46,746 [91493] WARNING  maint:     New chromosomes: []

as only some of the species are being updated here, we should arguably move from writing out ensembl_info.py file which has release data, to the release being held in a species specific slot in each genome_data.py file of the catalog. I'm happy to make that edit if people agree.

@andrewkern
Copy link
Member Author

an additional thought here-- I'm not doing any checking to see if chromosome lengths have changed. We may want to do that as well

@petrelharp
Copy link
Contributor

Okay, over in #1521 we said
Screenshot from 2025-01-07 15-12-21
and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

What about the other species, for which maintenance failed - what do you think we should do there?

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

@petrelharp
Copy link
Contributor

petrelharp commented Jan 7, 2025

And I guess we need to also change the tests:


=================================== FAILURES ===================================
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_CanFam.TestSpeciesData object at 0x7f34c0c12110>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "canis_familiaris"
E       AssertionError: assert 'canis_lupus_familiaris' == 'canis_familiaris'
E         
E         - canis_familiaris
E         + canis_lupus_familiaris
E         ?     ++++++

tests/test_CanFam.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_GasAcu.TestSpeciesData object at 0x7f34c0ba1780>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "9307941"
E       AssertionError: assert 'gasterosteus_aculeatus' == '9307941'
E         
E         - 9307941
E         + gasterosteus_aculeatus

tests/test_GasAcu.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_StrAga.TestSpeciesData object at 0x7f34c09b0d30>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "NA"
E       AssertionError: assert 'streptococcu...GCA_001017915' == 'NA'
E         
E         - NA
E         + streptococcus_agalactiae_GCA_001017915

tests/test_StrAga.py:12: AssertionError
_________________________________ test_version _________________________________

    def test_version():
        release = stdpopsim.catalog.ensembl_info.release
>       assert release == 103
E       assert 113 == 103

I'll go have a look at the tests now an update them in the PR

@andrewkern andrewkern mentioned this pull request Jan 8, 2025
11 tasks
@andrewkern
Copy link
Member Author

Okay, over in #1521 we said Screenshot from 2025-01-07 15-12-21 and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

Yes I'd like to add the ensembl_build version as a property of the genome_data -- I can add that to this PR I reckon

What about the other species, for which maintenance failed - what do you think we should do there?

If we include this property, nothing will need to done, other than maintain the current build version I think.

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

that's correct.

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

Copy link

codecov bot commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.76%. Comparing base (a344665) to head (35b8c71).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1646      +/-   ##
==========================================
- Coverage   99.84%   99.76%   -0.09%     
==========================================
  Files         133      132       -1     
  Lines        4569     4596      +27     
  Branches      472      472              
==========================================
+ Hits         4562     4585      +23     
- Misses          3        5       +2     
- Partials        4        6       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrewkern
Copy link
Member Author

a simple way forward for including species specific build info would look like this:

  • the data dict in each species genome_data.py would be extended with two slots
    • build_type which would take the values manual, ensembl, or NCBI
    • ensembl_build which would take the values NA or <ensembl_build_version_id, the later which would be auto-populated by the maintenance script

including a build_type property in the dict would also allow for an easy check as to which genomes the maintenance script should attempt to upload

@petrelharp
Copy link
Contributor

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

@petrelharp
Copy link
Contributor

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

@andrewkern
Copy link
Member Author

andrewkern commented Jan 8, 2025

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

actually i was suggesting that it would go not in the Genome object but in genome_data.data like here, but yes that would eventually get propagated to Genome

@andrewkern
Copy link
Member Author

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

This is a good suggestion-- genome_data.data could hold this build dict. As yes I mean None

@petrelharp
Copy link
Contributor

The data dictionary in genome_data.py is not visible to end users; it is back-end for how we set up the Genome and Species objects that are user-visible, and are what we should be discussing. Probably everything in that dict is then mirrored as an attribute in Genome, so the distinction is moot, though?

@andrewkern
Copy link
Member Author

okay @petrelharp -- added the attributes we talked about the Genome API. This led to lots of downstream changes as we discussed. Sorry for the large PR

@@ -177,6 +183,10 @@ def black_format(code):


def ensembl_stdpopsim_id(ensembl_id):
if ensembl_id == "canis_lupus_familiaris":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps insert a comment explaining what this is doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, explaining why this is necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, actually, I don't understand why it's necessary. I see below that now species.ensembl_id == "canis_lupus_familiaris", so where does ensembl_id equal "canis_familiaris"? Just missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah this is because of the transition that happened when running the maintenance script! now i bet i can take it out

maintenance/main.py Outdated Show resolved Hide resolved
data = self.ensembl_client.get_genome_data(ensembl_id)

# Preserve existing assembly source or default to "ensembl"
if genome_data_path.exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, this duplicates the code above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and, it seems like it's doing the same thing, almost, as the code above?)


if existing_chroms != new_chroms:
logger.warning(
f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch."
f"Skipping {sps_id} ({ensembl_id}): chromosome names in existing genome_data.py "
"do not match chromosomes in current ensembl release. \n"
f"Not in Ensembl: {existing_chroms - new_chroms}\n"
f"Not in existing genome_data.py: {new_chroms - existing_chroms}."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see - this gets printed below. Never mind?

data["assembly_build_version"] = None

# Check if existing genome data exists and compare chromosome names
if genome_data_path.exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing something different to the code above but duplicates the "open genome_data_path if it exists" code; just do that once?

for species_id, eid in embl_ids:
try:
result = writer.write_genome_data(eid)
if result is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that function could return None?

@@ -391,7 +520,7 @@ def add_species(ensembl_id, force):
"""
writer = DataWriter()
writer.add_species(ensembl_id.lower(), force=force)
writer.write_ensembl_release()
# writer.write_ensembl_release()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

"CM009944.2": {"length": 10670842, "synonyms": ["NC_037651.1"]},
"CM009945.2": {"length": 9534514, "synonyms": ["NC_037652.1"]},
"CM009946.2": {"length": 7238532, "synonyms": ["NC_037653.1"]},
"CM009947.2": {"length": 16343, "synonyms": ["NC_001566.1", "MT"]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a mistake to get rid of the MT synonym?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these are hella weird chromosome names anyhow, so meh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the way we have things set up, synonyms are automatically generated by what goes in from ensembl. they must have removed it

"MT": 1,
}
_ploidy = {str(i): _species_ploidy for i in range(1, 30)}
_ploidy.update({"X": _species_ploidy, "MT": 1})

_chromosomes = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_chromosomes is no longer used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right. forgot to delete these lines. will do

@@ -123,10 +123,11 @@
)
)

_genome = stdpopsim.Genome(
chromosomes=_chromosomes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_chromosomes no longer used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

_mutation_rate = 4e-9
_mutation_rate_data = {str(i): _mutation_rate for i in range(1, 39)}
_mutation_rate_data["MT"] = (
_mutation_rate # note this is likely incorrect but consistent with current setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have this note on nearly all the species, right? no need here?

@@ -137,10 +146,11 @@
)
)

_genome = stdpopsim.Genome(
chromosomes=_chromosomes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_chromosomes no longer used?

@@ -46,11 +45,24 @@
)
)

_genome = stdpopsim.Genome(
chromosomes=_chromosomes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_chromosomes no longer used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case it's used for chrom ids

@@ -45,6 +45,10 @@ class Genome:
:vartype assembly_name: str
:ivar assembly_accession: The ID of the genome assembly accession.
:vartype assembly_accession: str
:ivar assembly_source: The source of the genome assembly data.
Copy link
Contributor

@petrelharp petrelharp Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:ivar assembly_source: The source of the genome assembly data.
:ivar assembly_source: The source of the genome assembly data (for instance, "ensembl").
Use "manual" if manually entered.

@@ -45,6 +45,10 @@ class Genome:
:vartype assembly_name: str
:ivar assembly_accession: The ID of the genome assembly accession.
:vartype assembly_accession: str
:ivar assembly_source: The source of the genome assembly data.
:vartype assembly_source: str
:ivar assembly_build_version: The version of the genome assembly build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:ivar assembly_build_version: The version of the genome assembly build.
:ivar assembly_build_version: The version of the genome assembly build, or `None`.

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Some newly extraneous code, and some clarification needed, I think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants