-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve workflow for ChEMBL database test #98
Comments
Maybe add something like |
Well, the snapshot file has a size of about one gigabyte and GitHub refuses the upload (limit is 25 MB). I uploaded it to the repo using git lfs (limit is 2 GB for GitHub). |
Analysis
|
ChEMBL ID | sum formula from TUCAN | sum formula from ChEMBL |
---|---|---|
CHEMBL42403 | BH3O3 | H3BO3 |
CHEMBL542171 | Cl2H6N2 | H6Cl2N2 |
CHEMBL542448 | ClH4NO | H4ClNO |
CHEMBL1161633 | ClHO3 | HClO3 |
CHEMBL1161634 | ClHO4 | HClO4 |
CHEMBL1161635 | BrHO3 | HBrO3 |
CHEMBL1200706 | AlH3O3 | H3AlO3 |
CHEMBL1200939 | ClH4N | H4ClN |
CHEMBL1207889 | FHO3S | HFO3S |
CHEMBL1231052 | AsH3 | H3As |
CHEMBL1231461 | BrH | HBr |
CHEMBL1231821 | ClH | HCl |
CHEMBL1232767 | FH | HF |
CHEMBL1236189 | AsH3O3 | H3AsO3 |
CHEMBL1616046 | ClHO | HClO |
CHEMBL1879693 | ClH4NO4 | H4ClNO4 |
CHEMBL1906899 | ClHO2 | HClO2 |
CHEMBL1909080 | Cl2H12O6Sr | H12Cl2O6Sr |
CHEMBL1909222 | AsHO2 | HAsO2 |
CHEMBL1909275 | FH | HF |
CHEMBL2097011 | B4H2O7 | H2B4O7 |
CHEMBL2104840 | Cl2H3K2MgNa3O12P2S | H3Cl2K2MgNa3O12P2S |
CHEMBL2106388 | CaH2O2 | H2CaO2 |
CHEMBL2107006 | CaHNaO2 | HCaNaO2 |
CHEMBL2107567 | CaHO4P | HCaO4P |
CHEMBL2140344 | BF3H4O2 | H4BF3O2 |
CHEMBL2146121 | CaH2 | H2Ca |
CHEMBL2218895 | CaH4O4P2 | H4CaO4P2 |
CHEMBL2218916 | Ca5HO13P3 | HCa5O13P3 |
CHEMBL2365415 | As3H5O7 | H5As3O7 |
CHEMBL2374288 | AsH3O4 | H3AsO4 |
CHEMBL2448495 | As3H2Na3O7 | H2As3Na3O7 |
CHEMBL3184656 | BeH8O8S | H8BeO8S |
CHEMBL3185229 | Cl2H12MgO6 | H12Cl2MgO6 |
CHEMBL3185957 | BaCl2H4O2 | H4BaCl2O2 |
CHEMBL3186904 | BF4H | HBF4 |
CHEMBL3188962 | AsH15Na2O11 | H15AsNa2O11 |
CHEMBL3707333 | B2H4Na2O8 | H4B2Na2O8 |
CHEMBL3707334 | B2H4O8 | H4B2O8 |
CHEMBL3707387 | Cl4H2O11 | H2Cl4O11 |
CHEMBL3833310 | AlH5O4 | H5AlO4 |
CHEMBL3833314 | AlCl3H12O6 | H12AlCl3O6 |
CHEMBL3833322 | Al2H2MgO13Si4 | H2Al2MgO13Si4 |
CHEMBL3833332 | Al2H2O13S3 | H2Al2O13S3 |
CHEMBL3833350 | Al5H33Mg10O40S2 | H33Al5Mg10O40S2 |
CHEMBL3833365 | Al2H4O9Si2 | H4Al2O9Si2 |
CHEMBL3833375 | B4H20Na2O17 | H20B4Na2O17 |
CHEMBL3833408 | Al5H31Mg10O39S2 | H31Al5Mg10O39S2 |
CHEMBL3833411 | B4H2O7 | H2B4O7 |
CHEMBL3989874 | BaCaH4O4 | H4BaCaO4 |
CHEMBL4297206 | AgFH6N2 | H6AgFN2 |
CHEMBL4298415 | BrH | HBr |
CHEMBL4298435 | AsF6H | HAsF6 |
CHEMBL4299960 | Cl4H4O10 | H4Cl4O10 |
CHEMBL4450513 | AgHO | HAgO |
It seems ChEMBL annotates hydrogens first if carbon is absent.
test_roundtrip_molfile_graph_tucan_graph_tucan_graph
(serializer-parser roundtrip)
No fails.
Bijection test
171008 fails.
As far as I can tell, most of them are stereoisomers. For instance
- CHEMBL1573853,
- CHEMBL1716183,
- CHEMBL1882433,
- CHEMBL1883192,
- CHEMBL1884515,
- CHEMBL1906279 and
- CHEMBL2134455
all have the same TUCAN string (and identical InChIKey main layer).
Nothing to worry about at the moment.
Really excellent work, absolutely love it! About the bijection test, you mention that "most of them are stereoisomers" - is there a way to quantify this and in particular single out and show the ChEMBL codes for those which are not stereoisomers? Other than that, works as expected, since stereoisomers are not covered by definition at the moment. |
I guess with the first layer of the InChIKey (first 14 characters) we should be able to detect this. Thus, I just added the following to the analysis pipeline:
and among those
(1) gives 171008 compounds, (2) reduces this to 64 compounds. Table of compounds: (that's the same data like in the csv file)
|
I did not manage to check all of the structural pairs individually, but the problem seems to be related to tautomer formation and is not an issue of TUCAN but rather of InChI, as all the structures I looked at appear to be the same and should have identical TUCAn strings ;-) |
Related to #51.
Move workflow to a separate repository:
Separate Snakemake rules and Python code:
At the moment the Snakemake file is polluted with Python code for (a) SDFile creation from the SQLite database and (b) test execution.
Create snapshot:
At the moment a workflow run creates a log file with ChEMBL IDs that fail
test_invariance
.I'm proposing to let the workflow run create a "snapshot file" with the TUCAN strings and test results. This snapshot file should be versionable (text format) and should be added to the repository. CSV seems to be a wise choice and a csv writer is included in Python's standard library. The csv file should look like this:
I'm expecting a file size of several tens of megabytes.
Further tasks:
find
like it is used right now does not give a reproducible order)Make tests reusable:
test_invariance
in the Snakemake file is more or less a duplicate oftest_invariance
in TUCAN's tests.In order to improve reusability, most of
test_invariance
's implementation should be moved from TUCAN's test suite to a test module in the tucan package (i.e. into tucan/test_utils.py). The Snakemake workflow can then importtest_invariance
fromtucan.test_utils
and catch theAssertionError
.Add more tests:
Bijection test:
TUCAN's
test_bijection
is supposed to show that all (different) compounds in the test set have different TUCAN strings from each other. This should also be checked for the ChEMBL dataset, but the test implementation will be different:SELECT COUNT(tucan), tucan FROM table GROUP BY tucan HAVING COUNT(tucan) > 1
(thanks @fbroda)Tasks:
Licensing:
Tasks:
Other tasks:
The text was updated successfully, but these errors were encountered: