Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve workflow for ChEMBL database test #98

Open
9 of 11 tasks
flange-ipb opened this issue Nov 18, 2022 · 6 comments
Open
9 of 11 tasks

Improve workflow for ChEMBL database test #98

flange-ipb opened this issue Nov 18, 2022 · 6 comments

Comments

@flange-ipb
Copy link
Collaborator

flange-ipb commented Nov 18, 2022

Related to #51.

Move workflow to a separate repository:

Separate Snakemake rules and Python code:

At the moment the Snakemake file is polluted with Python code for (a) SDFile creation from the SQLite database and (b) test execution.

  • move (a) and (b) into separate Python modules.

Create snapshot:

At the moment a workflow run creates a log file with ChEMBL IDs that fail test_invariance.

I'm proposing to let the workflow run create a "snapshot file" with the TUCAN strings and test results. This snapshot file should be versionable (text format) and should be added to the repository. CSV seems to be a wise choice and a csv writer is included in Python's standard library. The csv file should look like this:

chembl_id,tucan,passed_test_invariance,passed_test_sumformula,passed_test_roundtrip_molfile_graph_tucan_graph_tucan_graph
CHEMBL6329,"C17H12ClN3O3/...",true,true,true
...

I'm expecting a file size of several tens of megabytes.

Further tasks:

  • concatenate snapshot files from all chunks ordered by the chunk number (find like it is used right now does not give a reproducible order)
  • document how a SQLite database can be created from the csv snapshot file

Make tests reusable:

test_invariance in the Snakemake file is more or less a duplicate of test_invariance in TUCAN's tests.

In order to improve reusability, most of test_invariance's implementation should be moved from TUCAN's test suite to a test module in the tucan package (i.e. into tucan/test_utils.py). The Snakemake workflow can then import test_invariance from tucan.test_utils and catch the AssertionError.

Add more tests:

Bijection test:

TUCAN's test_bijection is supposed to show that all (different) compounds in the test set have different TUCAN strings from each other. This should also be checked for the ChEMBL dataset, but the test implementation will be different:

  • create a SQLite database from the csv snapshot
  • add an index on the tucan column
  • find TUCAN string duplicates: SELECT COUNT(tucan), tucan FROM table GROUP BY tucan HAVING COUNT(tucan) > 1 (thanks @fbroda)

Tasks:

  • automate (shell or Python script)

Licensing:

Tasks:

  • add LICENSE file (CC-BY-SA 4.0) next to the snapshot file
  • add attribution via a readme next to the snapshot file
  • add GPLv3

Other tasks:

  • upgrade to ChEMBL version 31
@schatzsc
Copy link
Collaborator

Maybe add something like TUCAN-nest/ChEMBL-test to indicate it is not a ChEMBL version itself that is worked up?

@flange-ipb
Copy link
Collaborator Author

I'm expecting a file size of several tens of megabytes.

Well, the snapshot file has a size of about one gigabyte and GitHub refuses the upload (limit is 25 MB). I uploaded it to the repo using git lfs (limit is 2 GB for GitHub).

@flange-ipb
Copy link
Collaborator Author

Analysis

test_permutation_invariance

No fails.

test_sumformula

Compare sum formulas from the TUCAN string and from the ChEMBL compound (without charge annotations).

55 fails:

ChEMBL ID sum formula from TUCAN sum formula from ChEMBL
CHEMBL42403 BH3O3 H3BO3
CHEMBL542171 Cl2H6N2 H6Cl2N2
CHEMBL542448 ClH4NO H4ClNO
CHEMBL1161633 ClHO3 HClO3
CHEMBL1161634 ClHO4 HClO4
CHEMBL1161635 BrHO3 HBrO3
CHEMBL1200706 AlH3O3 H3AlO3
CHEMBL1200939 ClH4N H4ClN
CHEMBL1207889 FHO3S HFO3S
CHEMBL1231052 AsH3 H3As
CHEMBL1231461 BrH HBr
CHEMBL1231821 ClH HCl
CHEMBL1232767 FH HF
CHEMBL1236189 AsH3O3 H3AsO3
CHEMBL1616046 ClHO HClO
CHEMBL1879693 ClH4NO4 H4ClNO4
CHEMBL1906899 ClHO2 HClO2
CHEMBL1909080 Cl2H12O6Sr H12Cl2O6Sr
CHEMBL1909222 AsHO2 HAsO2
CHEMBL1909275 FH HF
CHEMBL2097011 B4H2O7 H2B4O7
CHEMBL2104840 Cl2H3K2MgNa3O12P2S H3Cl2K2MgNa3O12P2S
CHEMBL2106388 CaH2O2 H2CaO2
CHEMBL2107006 CaHNaO2 HCaNaO2
CHEMBL2107567 CaHO4P HCaO4P
CHEMBL2140344 BF3H4O2 H4BF3O2
CHEMBL2146121 CaH2 H2Ca
CHEMBL2218895 CaH4O4P2 H4CaO4P2
CHEMBL2218916 Ca5HO13P3 HCa5O13P3
CHEMBL2365415 As3H5O7 H5As3O7
CHEMBL2374288 AsH3O4 H3AsO4
CHEMBL2448495 As3H2Na3O7 H2As3Na3O7
CHEMBL3184656 BeH8O8S H8BeO8S
CHEMBL3185229 Cl2H12MgO6 H12Cl2MgO6
CHEMBL3185957 BaCl2H4O2 H4BaCl2O2
CHEMBL3186904 BF4H HBF4
CHEMBL3188962 AsH15Na2O11 H15AsNa2O11
CHEMBL3707333 B2H4Na2O8 H4B2Na2O8
CHEMBL3707334 B2H4O8 H4B2O8
CHEMBL3707387 Cl4H2O11 H2Cl4O11
CHEMBL3833310 AlH5O4 H5AlO4
CHEMBL3833314 AlCl3H12O6 H12AlCl3O6
CHEMBL3833322 Al2H2MgO13Si4 H2Al2MgO13Si4
CHEMBL3833332 Al2H2O13S3 H2Al2O13S3
CHEMBL3833350 Al5H33Mg10O40S2 H33Al5Mg10O40S2
CHEMBL3833365 Al2H4O9Si2 H4Al2O9Si2
CHEMBL3833375 B4H20Na2O17 H20B4Na2O17
CHEMBL3833408 Al5H31Mg10O39S2 H31Al5Mg10O39S2
CHEMBL3833411 B4H2O7 H2B4O7
CHEMBL3989874 BaCaH4O4 H4BaCaO4
CHEMBL4297206 AgFH6N2 H6AgFN2
CHEMBL4298415 BrH HBr
CHEMBL4298435 AsF6H HAsF6
CHEMBL4299960 Cl4H4O10 H4Cl4O10
CHEMBL4450513 AgHO HAgO

It seems ChEMBL annotates hydrogens first if carbon is absent.

test_roundtrip_molfile_graph_tucan_graph_tucan_graph (serializer-parser roundtrip)

No fails.

Bijection test

171008 fails.
As far as I can tell, most of them are stereoisomers. For instance

all have the same TUCAN string (and identical InChIKey main layer).
Nothing to worry about at the moment.

@schatzsc
Copy link
Collaborator

schatzsc commented Dec 1, 2022

Really excellent work, absolutely love it!

About the bijection test, you mention that "most of them are stereoisomers" - is there a way to quantify this and in particular single out and show the ChEMBL codes for those which are not stereoisomers?

Other than that, works as expected, since stereoisomers are not covered by definition at the moment.

@flange-ipb
Copy link
Collaborator Author

@schatzsc

in particular single out and show the ChEMBL codes for those which are not stereoisomers?

I guess with the first layer of the InChIKey (first 14 characters) we should be able to detect this.

Thus, I just added the following to the analysis pipeline:

  • (1) find all compounds with duplicate TUCAN strings (= what I called bijection test previously)

and among those

  • (2) find all compounds with different InChIKey first layer.

(1) gives 171008 compounds, (2) reduces this to 64 compounds.

Table of compounds: (that's the same data like in the csv file)

Duplicates TUCAN string
CHEMBL1200471 (OTPSWLRZXRHDNX-UHFFFAOYSA-L)
CHEMBL3392049 (PICXIOQBANWBIZ-UHFFFAOYSA-N)
C10H8N2O2S2Zn/(1-9)(2-10)(3-11)(4-12)(5-13)(6-14)(7-15)(8-16)(9-11)(9-13)(10-12)(10-14)(11-15)(12-16)(13-17)(14-18)(15-19)(16-20)(17-19)(17-23)(18-20)(18-24)(19-21)(20-22)
CHEMBL74469 (INFDPOAKFNIJBF-UHFFFAOYSA-N)
CHEMBL1451731 (CGIXVVWWIDFCNE-UHFFFAOYSA-N)
C12H14N2/(1-15)(2-16)(3-17)(4-18)(5-21)(6-21)(7-21)(8-22)(9-22)(10-22)(11-23)(12-24)(13-25)(14-26)(15-19)(15-23)(16-19)(16-24)(17-20)(17-25)(18-20)(18-26)(19-20)(21-27)(22-28)(23-27)(24-27)(25-28)(26-28)
CHEMBL1569487 (WLHQHAUOOXYABV-UHFFFAOYSA-N)
CHEMBL3188235 (GTASRFITTBVXRX-UHFFFAOYSA-N)
C13H10ClN3O4S2/(1-11)(2-12)(3-13)(4-14)(5-15)(6-15)(7-15)(8-16)(9-25)(10-28)(11-12)(11-13)(12-16)(13-18)(14-22)(14-23)(15-26)(16-24)(17-19)(17-20)(17-26)(18-24)(18-25)(19-21)(19-28)(20-25)(20-27)(21-22)(21-31)(22-32)(23-31)(23-33)(26-32)(29-32)(30-32)
CHEMBL302795 (LZNWYQJJBLGYLT-UHFFFAOYSA-N)
CHEMBL3188633 (MHLVRGMNJBUIRL-UHFFFAOYSA-N)
C13H11N3O4S2/(1-12)(2-13)(3-14)(4-15)(5-16)(6-16)(7-16)(8-17)(9-22)(10-26)(11-29)(12-13)(12-14)(13-17)(14-19)(15-22)(15-24)(16-27)(17-25)(18-20)(18-21)(18-27)(19-25)(19-26)(20-23)(20-29)(21-26)(21-28)(22-32)(23-24)(23-32)(24-33)(27-33)(30-33)(31-33)
CHEMBL599 (ZRVUJXDFFKFLMG-UHFFFAOYSA-N)
CHEMBL1741042 (JJMWPUUMNOUPNB-UHFFFAOYSA-N)
C14H13N3O4S2/(1-14)(2-14)(3-14)(4-15)(5-16)(6-17)(7-18)(8-20)(9-20)(10-20)(11-21)(12-29)(13-32)(14-25)(15-16)(15-17)(16-18)(17-19)(18-26)(19-23)(19-26)(20-30)(21-25)(21-28)(22-23)(22-24)(22-30)(23-32)(24-29)(24-31)(25-35)(26-36)(27-28)(27-29)(27-35)(30-36)(33-36)(34-36)
CHEMBL53292 (YYUAYBYLJSNDCX-UHFFFAOYSA-N)
CHEMBL1898455 (RSBXRGZKRGVKGF-UHFFFAOYSA-N)
C14H13N3O5S/(1-14)(2-14)(3-14)(4-15)(5-16)(6-17)(7-18)(8-19)(9-21)(10-21)(11-21)(12-28)(13-32)(14-25)(15-16)(15-17)(16-19)(17-20)(18-23)(18-25)(19-27)(20-24)(20-27)(21-30)(22-24)(22-26)(22-30)(23-28)(23-29)(24-32)(25-33)(26-28)(26-31)(27-36)(29-33)(30-36)(34-36)(35-36)
CHEMBL1091115 (NALREUIWICQLPS-UHFFFAOYSA-N)
CHEMBL3330735 (PGWTYMLATMNCCZ-UHFFFAOYSA-M)
C14H14ClN3S/(1-15)(2-16)(3-17)(4-18)(5-19)(6-20)(7-21)(8-21)(9-21)(10-22)(11-22)(12-22)(13-29)(14-29)(15-16)(15-23)(16-24)(17-18)(17-25)(18-26)(19-23)(19-27)(20-26)(20-28)(21-31)(22-31)(23-29)(24-27)(24-30)(25-28)(25-30)(26-31)(27-32)(28-32)
CHEMBL1199277 (HMYISFASHDWPMO-UHFFFAOYSA-O)
CHEMBL3558280 (FXONXBFTEMLSRZ-UHFFFAOYSA-N)
C14H14N3S/(1-15)(2-16)(3-17)(4-18)(5-19)(6-20)(7-21)(8-21)(9-21)(10-22)(11-22)(12-22)(13-29)(14-29)(15-16)(15-23)(16-24)(17-18)(17-25)(18-26)(19-23)(19-27)(20-26)(20-28)(21-31)(22-31)(23-29)(24-27)(24-30)(25-28)(25-30)(26-31)(27-32)(28-32)
CHEMBL2104833 (VCSAHSDZAKGXAT-AFEZEDKISA-M)
CHEMBL2282011 (WUHOMZPXRJWWDW-UHFFFAOYSA-M)
C14H8ClN2NaO3S/(1-9)(2-10)(3-11)(4-12)(5-13)(6-20)(7-23)(8-23)(9-12)(9-16)(10-11)(10-20)(11-21)(12-22)(13-14)(13-22)(14-15)(14-16)(15-17)(15-18)(16-24)(17-21)(17-25)(18-24)(18-26)(19-23)(19-24)(19-27)(20-29)(21-29)(22-30)
CHEMBL428064 (CYSOFAOLQAYKGU-UHFFFAOYSA-N)
CHEMBL1589921 (LKDGTXMRXOKPFZ-UHFFFAOYSA-N)
C15H10N2O2S/(1-11)(2-12)(3-13)(4-14)(5-15)(6-16)(7-17)(8-18)(9-22)(10-27)(11-12)(11-16)(12-18)(13-15)(13-19)(14-17)(14-19)(15-20)(16-21)(17-22)(18-25)(19-23)(20-21)(20-24)(21-25)(22-26)(23-24)(23-26)(24-27)(25-30)(27-30)(28-30)(29-30)
CHEMBL527 (QYSPLQLAKJAUJT-UHFFFAOYSA-N)
CHEMBL1518938 (IAPMPNKHZPZVQU-UHFFFAOYSA-N)
C15H13N3O4S/(1-14)(2-15)(3-16)(4-17)(5-18)(6-19)(7-20)(8-22)(9-22)(10-22)(11-23)(12-30)(13-33)(14-16)(14-17)(15-18)(15-19)(16-20)(17-21)(18-23)(19-25)(20-28)(21-26)(21-28)(22-31)(23-29)(24-26)(24-27)(24-31)(25-29)(25-30)(26-33)(27-30)(27-32)(28-36)(31-36)(34-36)(35-36)
CHEMBL562639 (GEDVVYWLPUPJJZ-UHFFFAOYSA-N)
CHEMBL1790006 (HNONEKILPDHFOL-UHFFFAOYSA-M)
C15H16ClN3S/(1-17)(2-17)(3-17)(4-18)(5-19)(6-20)(7-21)(8-22)(9-24)(10-24)(11-24)(12-25)(13-25)(14-25)(15-32)(16-32)(17-23)(18-19)(18-27)(19-29)(20-23)(20-28)(21-26)(21-31)(22-29)(22-30)(23-26)(24-34)(25-34)(26-32)(27-30)(27-33)(28-31)(28-33)(29-34)(30-35)(31-35)
CHEMBL1817784 (DNDJEIWCTMMZBX-UHFFFAOYSA-N)
CHEMBL3330736 (KFZNPGQYVZZSNV-UHFFFAOYSA-M)
C15H16ClN3S/(1-17)(2-18)(3-19)(4-20)(5-21)(6-22)(7-23)(8-23)(9-23)(10-24)(11-24)(12-24)(13-25)(14-25)(15-25)(16-33)(17-19)(17-26)(18-20)(18-27)(19-28)(20-29)(21-28)(21-30)(22-29)(22-31)(23-33)(24-34)(25-34)(26-30)(26-32)(27-31)(27-32)(28-33)(29-34)(30-35)(31-35)
CHEMBL1197206 (KZEUBCUXBNEMSQ-UHFFFAOYSA-O)
CHEMBL1622638 (RQNADZJRELIICT-UHFFFAOYSA-N)
C15H16N3S/(1-17)(2-17)(3-17)(4-18)(5-19)(6-20)(7-21)(8-22)(9-24)(10-24)(11-24)(12-25)(13-25)(14-25)(15-32)(16-32)(17-23)(18-19)(18-27)(19-29)(20-23)(20-28)(21-26)(21-31)(22-29)(22-30)(23-26)(24-34)(25-34)(26-32)(27-30)(27-33)(28-31)(28-33)(29-34)(30-35)(31-35)
CHEMBL1852066 (KKGWEPLBCABCDN-UHFFFAOYSA-O)
CHEMBL3558304 (VBFJMDUNUIVOMH-UHFFFAOYSA-N)
C15H16N3S/(1-17)(2-18)(3-19)(4-20)(5-21)(6-22)(7-23)(8-23)(9-23)(10-24)(11-24)(12-24)(13-25)(14-25)(15-25)(16-33)(17-19)(17-26)(18-20)(18-27)(19-28)(20-29)(21-28)(21-30)(22-29)(22-31)(23-33)(24-34)(25-34)(26-30)(26-32)(27-31)(27-32)(28-33)(29-34)(30-35)(31-35)
CHEMBL1729471 (MXLFKQWACADBHE-UHFFFAOYSA-N)
CHEMBL3913079 (KZLRTHGLORNBHC-UHFFFAOYSA-N)
C16H12N2O2S/(1-13)(2-13)(3-13)(4-14)(5-15)(6-16)(7-17)(8-18)(9-19)(10-20)(11-21)(12-30)(13-25)(14-15)(14-19)(15-21)(16-18)(16-22)(17-20)(17-22)(18-23)(19-24)(20-25)(21-28)(22-26)(23-24)(23-27)(24-28)(25-29)(26-27)(26-29)(27-30)(28-33)(30-33)(31-33)(32-33)
CHEMBL273148 (MXKZFNJSAXLWNC-UHFFFAOYSA-N)
CHEMBL1712499 (ZEPLLJACXSUGNZ-UHFFFAOYSA-N)
C17H11F3N2O2S/(1-12)(2-12)(3-12)(4-13)(5-14)(6-15)(7-16)(8-17)(9-18)(10-19)(11-30)(12-24)(13-15)(13-20)(14-18)(14-20)(15-21)(16-19)(16-22)(17-22)(17-23)(18-24)(19-28)(20-25)(21-23)(21-26)(22-27)(23-28)(24-29)(25-26)(25-29)(26-30)(27-33)(27-34)(27-35)(28-36)(30-36)(31-36)(32-36)
CHEMBL188744 (ZVHVJEWJNSDXDB-UHFFFAOYSA-N)
CHEMBL2005075 (CABFNLMCQJOTMT-UHFFFAOYSA-O)
C18H17N7O/(1-18)(2-18)(3-18)(4-19)(5-20)(6-21)(7-22)(8-23)(9-24)(10-25)(11-26)(12-27)(13-27)(14-36)(15-36)(16-37)(17-37)(18-27)(19-20)(19-21)(20-24)(21-25)(22-23)(22-28)(23-31)(24-32)(25-32)(26-28)(26-33)(27-30)(28-29)(29-30)(29-34)(30-38)(31-33)(31-40)(32-41)(33-42)(34-36)(34-39)(35-37)(35-38)(35-39)(40-41)(41-42)(42-43)
CHEMBL2105246 (DMHQLXUFCQSQQQ-JZJYNLBNSA-N)
CHEMBL3989800 (OWFUPROYPKGHMH-UHFFFAOYSA-N)
C18H18N4O2/(1-19)(2-19)(3-19)(4-20)(5-21)(6-22)(7-23)(8-24)(9-25)(10-26)(11-27)(12-28)(13-29)(14-30)(15-30)(16-32)(17-34)(18-38)(19-34)(20-22)(20-23)(21-24)(21-25)(22-26)(23-27)(24-28)(25-29)(26-31)(27-31)(28-33)(29-33)(30-31)(30-34)(32-35)(32-39)(33-38)(34-39)(35-37)(35-42)(36-37)(36-38)(36-41)(39-40)(40-42)
CHEMBL1200879 (KYITYFHKDODNCQ-UHFFFAOYSA-M)
CHEMBL2028193 (INUBUXBVYNRDGP-UHFFFAOYSA-M)
C19H15NaO4/(1-16)(2-16)(3-16)(4-17)(5-18)(6-19)(7-20)(8-21)(9-22)(10-23)(11-24)(12-25)(13-26)(14-26)(15-30)(16-31)(17-18)(17-19)(18-22)(19-23)(20-21)(20-24)(21-25)(22-27)(23-27)(24-28)(25-33)(26-30)(26-31)(27-30)(28-32)(28-33)(29-30)(29-32)(29-34)(31-35)(32-36)(33-38)(34-37)(34-38)
CHEMBL1178103 (DIGXMFZQXQUVMR-UHFFFAOYSA-N)
CHEMBL1627264 (PLMIUUBYGDLUDD-UHFFFAOYSA-N)
C21H23NO5S/(1-24)(2-24)(3-24)(4-25)(5-26)(6-27)(7-28)(8-29)(9-30)(10-30)(11-31)(12-31)(13-32)(14-32)(15-33)(16-33)(17-34)(18-34)(19-43)(20-43)(21-43)(22-45)(23-48)(24-30)(25-27)(25-38)(26-36)(26-38)(27-40)(28-35)(28-44)(29-37)(29-44)(30-31)(31-32)(32-33)(33-34)(34-35)(35-41)(36-39)(36-40)(37-39)(37-41)(38-42)(39-46)(40-49)(41-49)(42-47)(42-48)(43-51)(44-51)(45-51)(50-51)
CHEMBL90073 (FPQMLJDRTAEFLR-UHFFFAOYSA-N)
CHEMBL110934 (QKVZLTHLJDEDFA-IMVLJIQESA-N)
C22H22N6O2S/(1-23)(2-24)(3-25)(4-26)(5-27)(6-28)(7-29)(8-30)(9-31)(10-32)(11-33)(12-36)(13-36)(14-36)(15-37)(16-37)(17-37)(18-44)(19-44)(20-44)(21-46)(22-50)(23-24)(23-25)(24-27)(25-34)(26-30)(26-35)(27-38)(28-31)(28-40)(29-32)(29-40)(30-42)(31-43)(32-43)(33-39)(33-42)(34-38)(34-41)(35-39)(35-41)(36-48)(37-48)(38-45)(39-45)(40-46)(41-46)(42-47)(43-50)(44-53)(47-49)(48-49)(50-53)(51-53)(52-53)
CHEMBL45260 (WHQCODQJWHWKCA-UHFFFAOYSA-N)
CHEMBL42192 (PJZWENDBSVLVEB-UHFFFAOYSA-N)
C34H40N4/(1-41)(2-42)(3-43)(4-44)(5-45)(6-46)(7-47)(8-48)(9-49)(10-50)(11-51)(12-52)(13-53)(14-54)(15-55)(16-56)(17-63)(18-63)(19-63)(20-64)(21-64)(22-64)(23-65)(24-65)(25-65)(26-66)(27-66)(28-66)(29-67)(30-67)(31-67)(32-68)(33-68)(34-68)(35-69)(36-69)(37-69)(38-70)(39-70)(40-70)(41-49)(41-57)(42-50)(42-57)(43-51)(43-58)(44-52)(44-58)(45-53)(45-59)(46-54)(46-59)(47-55)(47-60)(48-56)(48-60)(49-71)(50-71)(51-72)(52-72)(53-73)(54-73)(55-74)(56-74)(57-61)(58-61)(59-62)(60-62)(61-62)(63-75)(64-75)(65-76)(66-76)(67-77)(68-77)(69-78)(70-78)(71-75)(72-76)(73-77)(74-78)
CHEMBL2105351 (XNRNJIIJLOFJEK-UHFFFAOYSA-N)
CHEMBL2364542 (WNGMMIYXPIAYOB-UHFFFAOYSA-M)
C5H4NNaOS/(1-5)(2-6)(3-7)(4-8)(5-6)(5-7)(6-8)(7-9)(8-10)(9-10)(9-13)(10-11)
CHEMBL357626 (MIJOCOSWJQYXGU-TZKMECQKSA-N)
CHEMBL1627232 (PSCDTFAJGFDXAO-TZKMECQKSA-N)
C6H11NO6S/(1-12)(2-12)(3-16)(4-16)(5-16)(6-17)(7-17)(8-18)(9-21)(10-22)(11-23)(12-13)(12-14)(13-15)(13-17)(13-21)(14-19)(14-22)(15-20)(15-23)(16-25)(17-25)(18-25)(24-25)
CHEMBL842 (JBMKAUGHUNFTOL-UHFFFAOYSA-N)
CHEMBL3392493 (MRLXNACLEKEMCQ-UHFFFAOYSA-N)
C7H6ClN3O4S2/(1-7)(2-8)(3-10)(4-15)(5-15)(6-16)(7-11)(7-12)(8-9)(8-13)(9-11)(9-14)(10-14)(10-16)(11-22)(12-13)(12-21)(13-23)(15-21)(16-22)(17-21)(18-21)(19-22)(20-22)
CHEMBL1514931 (ZVBIRPKGWOVBLG-UHFFFAOYSA-N)
CHEMBL1596271 (QVADSMGRZHVGPL-UHFFFAOYSA-M)
C8H7N2NaO3S/(1-8)(2-9)(3-10)(4-11)(5-15)(6-15)(7-17)(8-9)(8-10)(9-11)(10-12)(11-14)(12-13)(12-14)(13-15)(13-16)(14-18)(15-22)(16-18)(17-22)(19-22)(20-22)
CHEMBL2107528 (ZPNRBQVNNIDJHX-UHFFFAOYSA-M)
CHEMBL3186518 (DSOWAKKSGYUMTF-GZOLSCHFSA-M)
C8H7NaO4/(1-8)(2-8)(3-8)(4-9)(5-9)(6-9)(7-10)(8-12)(9-14)(10-13)(10-14)(11-12)(11-13)(11-15)(12-16)(13-17)(14-19)(15-18)(15-19)
CHEMBL1741516 (AGQSEIGTBQIVBF-UHFFFAOYSA-N)
CHEMBL1989025 (PZUHHXFIMAEXGO-UHFFFAOYSA-N)
C9H8N2O2/(1-9)(2-9)(3-9)(4-10)(5-11)(6-12)(7-13)(8-14)(9-15)(10-11)(10-12)(11-13)(12-16)(13-17)(14-15)(14-18)(15-19)(16-17)(16-18)(17-19)(18-20)(19-21)
CHEMBL1644028 (VGTPCRGMBIAPIM-UHFFFAOYSA-M)
CHEMBL2207072 (RROSXLCQOOGZBR-UHFFFAOYSA-N)
CNNaS/(1-2)(1-4)
CHEMBL448089 (IBJRLHARKKQREK-UHFFFAOYSA-M)
CHEMBL2069541 (YTCRRZPPDGMMBV-UHFFFAOYSA-N)
CNNaSe/(1-2)(1-4)
CHEMBL1201279 (HCHKCACWOHOZIP-UHFFFAOYSA-N)
CHEMBL1236970 (PTFCDOFLOPIGGS-UHFFFAOYSA-N)
Zn/

@schatzsc
Copy link
Collaborator

schatzsc commented Dec 5, 2022

I did not manage to check all of the structural pairs individually, but the problem seems to be related to tautomer formation and is not an issue of TUCAN but rather of InChI, as all the structures I looked at appear to be the same and should have identical TUCAn strings ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants