-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changes to maintenance script / ensembl push #1646
base: main
Are you sure you want to change the base?
changes to maintenance script / ensembl push #1646
Conversation
an additional thought here-- I'm not doing any checking to see if chromosome lengths have changed. We may want to do that as well |
Okay, over in #1521 we said What about the other species, for which maintenance failed - what do you think we should do there? Hm - nothing of consequence seems to be changed here, other than adding two Interestingly, I see that we are maybe already using different ensembl releases:
|
And I guess we need to also change the tests:
I'll go have a look at the tests now an update them in the PR |
Yes I'd like to add the ensembl_build version as a property of the genome_data -- I can add that to this PR I reckon
If we include this property, nothing will need to done, other than maintain the current build version I think.
that's correct.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1646 +/- ##
==========================================
- Coverage 99.84% 99.76% -0.09%
==========================================
Files 133 132 -1
Lines 4569 4596 +27
Branches 472 472
==========================================
+ Hits 4562 4585 +23
- Misses 3 5 +2
- Partials 4 6 +2 ☔ View full report in Codecov by Sentry. |
a simple way forward for including species specific build info would look like this:
including a |
We should probably talk about the API, rather than how it gets in there via the helper function that parses But, assuming that you are suggesting that we add those slots also to the |
Actually, how about an attribute Or, just two attributes, Also, instead of |
actually i was suggesting that it would go not in the |
This is a good suggestion-- |
The |
okay @petrelharp -- added the attributes we talked about the |
@@ -177,6 +183,10 @@ def black_format(code): | |||
|
|||
|
|||
def ensembl_stdpopsim_id(ensembl_id): | |||
if ensembl_id == "canis_lupus_familiaris": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps insert a comment explaining what this is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, explaining why this is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, actually, I don't understand why it's necessary. I see below that now species.ensembl_id == "canis_lupus_familiaris"
, so where does ensembl_id
equal "canis_familiaris"
? Just missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah this is because of the transition that happened when running the maintenance script! now i bet i can take it out
maintenance/main.py
Outdated
data = self.ensembl_client.get_genome_data(ensembl_id) | ||
|
||
# Preserve existing assembly source or default to "ensembl" | ||
if genome_data_path.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, this duplicates the code above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and, it seems like it's doing the same thing, almost, as the code above?)
maintenance/main.py
Outdated
|
||
if existing_chroms != new_chroms: | ||
logger.warning( | ||
f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Skipping {sps_id} ({ensembl_id}): chromosome names mismatch." | |
f"Skipping {sps_id} ({ensembl_id}): chromosome names in existing genome_data.py " | |
"do not match chromosomes in current ensembl release. \n" | |
f"Not in Ensembl: {existing_chroms - new_chroms}\n" | |
f"Not in existing genome_data.py: {new_chroms - existing_chroms}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see - this gets printed below. Never mind?
maintenance/main.py
Outdated
data["assembly_build_version"] = None | ||
|
||
# Check if existing genome data exists and compare chromosome names | ||
if genome_data_path.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doing something different to the code above but duplicates the "open genome_data_path if it exists" code; just do that once?
maintenance/main.py
Outdated
for species_id, eid in embl_ids: | ||
try: | ||
result = writer.write_genome_data(eid) | ||
if result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how that function could return None
?
maintenance/main.py
Outdated
@@ -391,7 +520,7 @@ def add_species(ensembl_id, force): | |||
""" | |||
writer = DataWriter() | |||
writer.add_species(ensembl_id.lower(), force=force) | |||
writer.write_ensembl_release() | |||
# writer.write_ensembl_release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
"CM009944.2": {"length": 10670842, "synonyms": ["NC_037651.1"]}, | ||
"CM009945.2": {"length": 9534514, "synonyms": ["NC_037652.1"]}, | ||
"CM009946.2": {"length": 7238532, "synonyms": ["NC_037653.1"]}, | ||
"CM009947.2": {"length": 16343, "synonyms": ["NC_001566.1", "MT"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a mistake to get rid of the MT
synonym?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But these are hella weird chromosome names anyhow, so meh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the way we have things set up, synonyms are automatically generated by what goes in from ensembl. they must have removed it
stdpopsim/catalog/BosTau/species.py
Outdated
"MT": 1, | ||
} | ||
_ploidy = {str(i): _species_ploidy for i in range(1, 30)} | ||
_ploidy.update({"X": _species_ploidy, "MT": 1}) | ||
|
||
_chromosomes = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_chromosomes
is no longer used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's right. forgot to delete these lines. will do
@@ -123,10 +123,11 @@ | |||
) | |||
) | |||
|
|||
_genome = stdpopsim.Genome( | |||
chromosomes=_chromosomes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_chromosomes
no longer used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
stdpopsim/catalog/CanFam/species.py
Outdated
_mutation_rate = 4e-9 | ||
_mutation_rate_data = {str(i): _mutation_rate for i in range(1, 39)} | ||
_mutation_rate_data["MT"] = ( | ||
_mutation_rate # note this is likely incorrect but consistent with current setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could have this note on nearly all the species, right? no need here?
@@ -137,10 +146,11 @@ | |||
) | |||
) | |||
|
|||
_genome = stdpopsim.Genome( | |||
chromosomes=_chromosomes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_chromosomes
no longer used?
@@ -46,11 +45,24 @@ | |||
) | |||
) | |||
|
|||
_genome = stdpopsim.Genome( | |||
chromosomes=_chromosomes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_chromosomes
no longer used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case it's used for chrom ids
@@ -45,6 +45,10 @@ class Genome: | |||
:vartype assembly_name: str | |||
:ivar assembly_accession: The ID of the genome assembly accession. | |||
:vartype assembly_accession: str | |||
:ivar assembly_source: The source of the genome assembly data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ivar assembly_source: The source of the genome assembly data. | |
:ivar assembly_source: The source of the genome assembly data (for instance, "ensembl"). | |
Use "manual" if manually entered. |
stdpopsim/genomes.py
Outdated
@@ -45,6 +45,10 @@ class Genome: | |||
:vartype assembly_name: str | |||
:ivar assembly_accession: The ID of the genome assembly accession. | |||
:vartype assembly_accession: str | |||
:ivar assembly_source: The source of the genome assembly data. | |||
:vartype assembly_source: str | |||
:ivar assembly_build_version: The version of the genome assembly build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ivar assembly_build_version: The version of the genome assembly build. | |
:ivar assembly_build_version: The version of the genome assembly build, or `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Some newly extraneous code, and some clarification needed, I think?
This PR addresses (closes?) #1521.
think i've finally cleaned this up a bit. this PR overhauls the
update-genome-data
portion of the maintenance script such that:the final report looks like this:
as only some of the species are being updated here, we should arguably move from writing out
ensembl_info.py
file which has release data, to the release being held in a species specific slot in eachgenome_data.py
file of the catalog. I'm happy to make that edit if people agree.