changes to maintenance script / ensembl push #1646

andrewkern · 2025-01-07T22:48:16Z

This PR addresses (closes?) #1521.

think i've finally cleaned this up a bit. this PR overhauls the update-genome-data portion of the maintenance script such that:

species whose genomes were manually created are skipped
species where the ensembl Rest API returns mismatching chromosome names with the current release are skipped
detailed logging warnings of these issues are created, along with a final report of which species were skipped in the update and why.

the final report looks like this:

=== Species Update Summary ===
2025-01-07 14:34:46,744 [91493] WARNING  maint: The following species were not updated:
2025-01-07 14:34:46,744 [91493] WARNING  maint:   - AnoCar (Ensembl ID: anolis_carolinensis):
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,744 [91493] WARNING  maint:     Existing chromosomes: ['1', '2', '3', '4', '5', '6', 'LGa', 'LGb', 'LGc', 'LGd', 'LGf', 'LGg', 'LGh', 'MT']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '2', '3', '4', '5', '6', 'a', 'b', 'c', 'd', 'f', 'g', 'h']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - CanFam (Ensembl ID: canis_lupus_familiaris):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - DroSec (Ensembl ID: drosophila_sechellia): Manually created genome data file
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - GasAcu (Ensembl ID: gasterosteus_aculeatus):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9', 'MT', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['I', 'II', 'III', 'IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XIX', 'XV', 'XVI', 'XVII', 'XVIII', 'XX', 'XXI', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - HelMel (Ensembl ID: heliconius_melpomene):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '3', '4', '5', '6', '7', '8', '9']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: []
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PanTro (Ensembl ID: pan_troglodytes):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - PonAbe (Ensembl ID: pongo_abelii):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:     New chromosomes: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', 'X']
2025-01-07 14:34:46,745 [91493] WARNING  maint:   - StrAga (Ensembl ID: streptococcus_agalactiae_GCA_001017915):
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Chromosome names mismatch.
2025-01-07 14:34:46,745 [91493] WARNING  maint:     Existing chromosomes: ['1']
2025-01-07 14:34:46,746 [91493] WARNING  maint:     New chromosomes: []

as only some of the species are being updated here, we should arguably move from writing out ensembl_info.py file which has release data, to the release being held in a species specific slot in each genome_data.py file of the catalog. I'm happy to make that edit if people agree.

andrewkern · 2025-01-07T22:50:00Z

an additional thought here-- I'm not doing any checking to see if chromosome lengths have changed. We may want to do that as well

petrelharp · 2025-01-07T23:20:23Z

Okay, over in #1521 we said

and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

What about the other species, for which maintenance failed - what do you think we should do there?

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

petrelharp · 2025-01-07T23:21:27Z

And I guess we need to also change the tests:


=================================== FAILURES ===================================
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_CanFam.TestSpeciesData object at 0x7f34c0c12110>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "canis_familiaris"
E       AssertionError: assert 'canis_lupus_familiaris' == 'canis_familiaris'
E         
E         - canis_familiaris
E         + canis_lupus_familiaris
E         ?     ++++++

tests/test_CanFam.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_GasAcu.TestSpeciesData object at 0x7f34c0ba1780>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "9307941"
E       AssertionError: assert 'gasterosteus_aculeatus' == '9307941'
E         
E         - 9307941
E         + gasterosteus_aculeatus

tests/test_GasAcu.py:12: AssertionError
_______________________ TestSpeciesData.test_ensembl_id ________________________

self = <tests.test_StrAga.TestSpeciesData object at 0x7f34c09b0d30>

    def test_ensembl_id(self):
>       assert self.species.ensembl_id == "NA"
E       AssertionError: assert 'streptococcu...GCA_001017915' == 'NA'
E         
E         - NA
E         + streptococcus_agalactiae_GCA_001017915

tests/test_StrAga.py:12: AssertionError
_________________________________ test_version _________________________________

    def test_version():
        release = stdpopsim.catalog.ensembl_info.release
>       assert release == 103
E       assert 113 == 103

I'll go have a look at the tests now an update them in the PR

andrewkern · 2025-01-08T18:46:13Z

Okay, over in #1521 we said and IIUC what you're doing here is step (1), as well as updating the maintenance script to be able to do this reasonably? But, this is not actually updating all the genomes, right? Seems like you should be recording here which genomes have been updated - at least in a comment, so we can next easily come along and do step (2)?

Yes I'd like to add the ensembl_build version as a property of the genome_data -- I can add that to this PR I reckon

What about the other species, for which maintenance failed - what do you think we should do there?

If we include this property, nothing will need to done, other than maintain the current build version I think.

Hm - nothing of consequence seems to be changed here, other than adding two Y chromosomes, right? All the genome lengths are the same, and we don't use assembly_name for anything? Is the only thing of consequence that could change be the chromosome lengths? (besides the chromosome names)

that's correct.

Interestingly, I see that we are maybe already using different ensembl releases:

$ grep ensembl.org stdpopsim/**/*.py
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/DroMel/annotations.py:        "http://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/HomSap/annotations.py:        "ftp://ftp.ensembl.org/pub/release-104/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"
stdpopsim/catalog/PhoSin/annotations.py:        "https://ftp.ensembl.org/pub/release-110/"

codecov · 2025-01-08T18:59:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.76%. Comparing base (a344665) to head (35b8c71).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1646      +/-   ##
==========================================
- Coverage   99.84%   99.76%   -0.09%     
==========================================
  Files         133      132       -1     
  Lines        4569     4596      +27     
  Branches      472      472              
==========================================
+ Hits         4562     4585      +23     
- Misses          3        5       +2     
- Partials        4        6       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andrewkern · 2025-01-08T19:05:01Z

a simple way forward for including species specific build info would look like this:

the data dict in each species genome_data.py would be extended with two slots
- build_type which would take the values manual, ensembl, or NCBI
- ensembl_build which would take the values NA or <ensembl_build_version_id, the later which would be auto-populated by the maintenance script

including a build_type property in the dict would also allow for an easy check as to which genomes the maintenance script should attempt to upload

petrelharp · 2025-01-08T22:48:38Z

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

petrelharp · 2025-01-08T22:54:54Z

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

andrewkern · 2025-01-08T23:46:32Z

We should probably talk about the API, rather than how it gets in there via the helper function that parses genome_data.py?

But, assuming that you are suggesting that we add those slots also to the Genome object, that sounds good.

actually i was suggesting that it would go not in the Genome object but in genome_data.data like here, but yes that would eventually get propagated to Genome

andrewkern · 2025-01-08T23:47:21Z

Actually, how about an attribute build, which is then a dict, containing type and optionally other things, like version? Or a NamedTuple or something like that? This is because we'd like to not start adding lots of slots for, say ncbi_build_version, etc etc.

Or, just two attributes, build_type and build_version?

Also, instead of NA I supposed you mean None?

This is a good suggestion-- genome_data.data could hold this build dict. As yes I mean None

petrelharp · 2025-01-09T14:35:40Z

The data dictionary in genome_data.py is not visible to end users; it is back-end for how we set up the Genome and Species objects that are user-visible, and are what we should be discussing. Probably everything in that dict is then mirrored as an attribute in Genome, so the distinction is moot, though?

andrewkern · 2025-01-10T00:01:29Z

okay @petrelharp -- added the attributes we talked about the Genome API. This led to lots of downstream changes as we discussed. Sorry for the large PR

petrelharp · 2025-01-13T02:46:21Z

maintenance/main.py

@@ -177,6 +183,10 @@ def black_format(code):


 def ensembl_stdpopsim_id(ensembl_id):
+    if ensembl_id == "canis_lupus_familiaris":


perhaps insert a comment explaining what this is doing?

I mean, explaining why this is necessary

And, actually, I don't understand why it's necessary. I see below that now species.ensembl_id == "canis_lupus_familiaris", so where does ensembl_id equal "canis_familiaris"? Just missing something here.

hah this is because of the transition that happened when running the maintenance script! now i bet i can take it out

maintenance/main.py

petrelharp · 2025-01-13T14:03:23Z

maintenance/main.py

        data = self.ensembl_client.get_genome_data(ensembl_id)
+
+        # Preserve existing assembly source or default to "ensembl"
+        if genome_data_path.exists():


wait, this duplicates the code above?

(and, it seems like it's doing the same thing, almost, as the code above?)

petrelharp · 2025-01-13T14:07:42Z

maintenance/main.py

+
+                if existing_chroms != new_chroms:
+                    logger.warning(
+                        f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch."


Suggested change

f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch."

f"Skipping {sps_id} ({ensembl_id}): chromosome names in existing genome_data.py "

"do not match chromosomes in current ensembl release. \n"

f"Not in Ensembl: {existing_chroms - new_chroms}\n"

f"Not in existing genome_data.py: {new_chroms - existing_chroms}."

Oh I see - this gets printed below. Never mind?

petrelharp · 2025-01-13T14:09:03Z

maintenance/main.py

+            data["assembly_build_version"] = None
+
+        # Check if existing genome data exists and compare chromosome names
+        if genome_data_path.exists():


This is doing something different to the code above but duplicates the "open genome_data_path if it exists" code; just do that once?

maintenance/main.py

petrelharp · 2025-01-13T14:14:15Z

maintenance/main.py

+    for species_id, eid in embl_ids:
+        try:
+            result = writer.write_genome_data(eid)
+            if result is not None:


I don't see how that function could return None?

petrelharp · 2025-01-13T14:18:01Z

maintenance/main.py

@@ -391,7 +520,7 @@ def add_species(ensembl_id, force):
    """
    writer = DataWriter()
    writer.add_species(ensembl_id.lower(), force=force)
-    writer.write_ensembl_release()
+    # writer.write_ensembl_release()


petrelharp · 2025-01-13T14:18:48Z

stdpopsim/catalog/ApiMel/genome_data.py

-        "CM009944.2": {"length": 10670842, "synonyms": ["NC_037651.1"]},
-        "CM009945.2": {"length": 9534514, "synonyms": ["NC_037652.1"]},
-        "CM009946.2": {"length": 7238532, "synonyms": ["NC_037653.1"]},
-        "CM009947.2": {"length": 16343, "synonyms": ["NC_001566.1", "MT"]},


Seems a mistake to get rid of the MT synonym?

But these are hella weird chromosome names anyhow, so meh?

so the way we have things set up, synonyms are automatically generated by what goes in from ensembl. they must have removed it

petrelharp · 2025-01-13T14:33:12Z

stdpopsim/catalog/BosTau/species.py

-    "MT": 1,
-}
+_ploidy = {str(i): _species_ploidy for i in range(1, 30)}
+_ploidy.update({"X": _species_ploidy, "MT": 1})

 _chromosomes = []


_chromosomes is no longer used

yes that's right. forgot to delete these lines. will do

petrelharp · 2025-01-13T14:34:23Z

stdpopsim/catalog/CaeEle/species.py

@@ -123,10 +123,11 @@
        )
    )

-_genome = stdpopsim.Genome(
-    chromosomes=_chromosomes,


_chromosomes no longer used?

petrelharp · 2025-01-13T14:35:15Z

stdpopsim/catalog/CanFam/species.py

+_mutation_rate = 4e-9
+_mutation_rate_data = {str(i): _mutation_rate for i in range(1, 39)}
+_mutation_rate_data["MT"] = (
+    _mutation_rate  # note this is likely incorrect but consistent with current setup


we could have this note on nearly all the species, right? no need here?

petrelharp · 2025-01-13T14:35:31Z

stdpopsim/catalog/CanFam/species.py

@@ -137,10 +146,11 @@
        )
    )

-_genome = stdpopsim.Genome(
-    chromosomes=_chromosomes,


_chromosomes no longer used?

petrelharp · 2025-01-13T14:36:31Z

stdpopsim/catalog/EscCol/species.py

@@ -46,11 +45,24 @@
        )
    )

-_genome = stdpopsim.Genome(
-    chromosomes=_chromosomes,


_chromosomes no longer used?

in this case it's used for chrom ids

petrelharp · 2025-01-13T14:38:05Z

stdpopsim/genomes.py

@@ -45,6 +45,10 @@ class Genome:
    :vartype assembly_name: str
    :ivar assembly_accession: The ID of the genome assembly accession.
    :vartype assembly_accession: str
+    :ivar assembly_source: The source of the genome assembly data.


Suggested change

:ivar assembly_source: The source of the genome assembly data.

:ivar assembly_source: The source of the genome assembly data (for instance, "ensembl").

Use "manual" if manually entered.

petrelharp · 2025-01-13T14:38:24Z

stdpopsim/genomes.py

@@ -45,6 +45,10 @@ class Genome:
    :vartype assembly_name: str
    :ivar assembly_accession: The ID of the genome assembly accession.
    :vartype assembly_accession: str
+    :ivar assembly_source: The source of the genome assembly data.
+    :vartype assembly_source: str
+    :ivar assembly_build_version: The version of the genome assembly build.


Suggested change

:ivar assembly_build_version: The version of the genome assembly build.

:ivar assembly_build_version: The version of the genome assembly build, or `None`.

petrelharp

Generally looks good. Some newly extraneous code, and some clarification needed, I think?

Co-authored-by: Peter Ralph <[email protected]>

changes to maintenance script / ensembl push

4e03aed

andrewkern requested review from petrelharp and nspope January 7, 2025 22:48

andrewkern mentioned this pull request Jan 8, 2025

Release 0.2.1 checklist #1565

Open

11 tasks

update ensembl tests

2fb9499

added assembly attributes to API; lots of associated changes

71751a0

added test template stubs for new attributes

1a9b881

petrelharp reviewed Jan 13, 2025

View reviewed changes

maintenance/main.py Outdated Show resolved Hide resolved

petrelharp reviewed Jan 13, 2025

View reviewed changes

maintenance/main.py Show resolved Hide resolved

petrelharp reviewed Jan 13, 2025

View reviewed changes

petrelharp requested changes Jan 13, 2025

View reviewed changes

andrewkern and others added 5 commits January 13, 2025 07:03

Update maintenance/main.py

d120909

Co-authored-by: Peter Ralph <[email protected]>

Update species.py

18165af

Peter's edits to main maintenance

35b8c71

clean up of chrom definitions

c2f14c8

more helpful variable descriptions in API

44bcae9

		@@ -177,6 +183,10 @@ def black_format(code):


		def ensembl_stdpopsim_id(ensembl_id):
		if ensembl_id == "canis_lupus_familiaris":

-                        f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch."
+                        f"Skipping {sps_id} ({ensembl_id}): chromosome names in existing genome_data.py "
+                        "do not match chromosomes in current ensembl release. \n"
+                        f"Not in Ensembl: {existing_chroms - new_chroms}\n"
+                        f"Not in existing genome_data.py: {new_chroms - existing_chroms}."

	:ivar assembly_source: The source of the genome assembly data.
	:ivar assembly_source: The source of the genome assembly data (for instance, "ensembl").
	Use "manual" if manually entered.

	:ivar assembly_build_version: The version of the genome assembly build.
	:ivar assembly_build_version: The version of the genome assembly build, or `None`.

changes to maintenance script / ensembl push #1646

Are you sure you want to change the base?

changes to maintenance script / ensembl push #1646

Conversation

andrewkern commented Jan 7, 2025 • edited by petrelharp Loading

andrewkern commented Jan 7, 2025

petrelharp commented Jan 7, 2025

petrelharp commented Jan 7, 2025 • edited by andrewkern Loading

andrewkern commented Jan 8, 2025

codecov bot commented Jan 8, 2025 • edited Loading

Codecov Report

andrewkern commented Jan 8, 2025

petrelharp commented Jan 8, 2025

petrelharp commented Jan 8, 2025

andrewkern commented Jan 8, 2025 • edited Loading

andrewkern commented Jan 8, 2025

petrelharp commented Jan 9, 2025

andrewkern commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrelharp Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petrelharp left a comment

Choose a reason for hiding this comment

andrewkern commented Jan 7, 2025 •

edited by petrelharp

Loading

petrelharp commented Jan 7, 2025 •

edited by andrewkern

Loading

codecov bot commented Jan 8, 2025 •

edited

Loading

andrewkern commented Jan 8, 2025 •

edited

Loading

petrelharp Jan 13, 2025 •

edited

Loading