-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using non-unique cycle finding as invariant #52
Comments
Thanks for the issue! My (for now unverified) assumption (aka the mother of all fuck-ups) is that CycleCut only identifies chordless cycles as in https://en.wikipedia.org/wiki/Induced_path#atomic_cycles:
Yes, I was wondering about that too. How does splitting automorphism groups influence the correctness of the canonicalization?
|
I'm pretty sure it means it fundamentally won't work - would need to think about it to construct a counter example or just run a shuffle test (randomising atom order) over a sufficiently large set. It is a little suspicious that "rigid graph benchmark" (I'd not seen that before but is cool - thanks!) seemed to be trivial, NAUTY/SAUCY etc are something else (as in hyper efficient) and a benchmark that defeats them but is easily passed with a simple cycle invariant seems odd. In the RDKit canonical paper (below) there is a strong invariant which I think NAUTY also likes to use as well (in that case it's configurable though). I think some people call it the "degree vector" but essentially it's a BFS from each node and you write down how many nodes are distance 0, 1, 2, 3, n. away. IIRC there is a slightly stronger variant in the RDKit paper. You will still likely go fast on the rigorous graph benchmark since I presume the NetworkX (was it?) is not backtracking and building the automorphism set like NAUTY etc. You have the right idea and disregarding stereochemistry it's likely possible to handle the majority (perhaps all) chemical graphs with a strong enough graph invariant and avoid the expensive backtracking of NAUTY/InChI - the balance is you can't "know" you have a unique answer just that you "probably" do. https://pubs.acs.org/doi/10.1021/acs.jcim.5b00543 |
One thing that is confusing is I would expect something simple like:
to generate different ID's - since the invariant code uses the cycleid (which depends on the order of the cycle in the molecule) but this is not the case:
I suspected that the encoding "gets lucky" with a starting point, we can force it to start somewhere else:
|
Sorry my mistake - that first one is different! "/14-15" and "/15-16" at the end. Was hard to spot with all the numbers. |
You're right, it seems like my implementation of CycleCut does return cycles with chords (see cycle
Will come back to this next week, thanks for the input so far! |
Just to be clear there are two separate (but related issues).
|
@johnmay, I think I got to the bottom of the issue. The cycle ID is indeed unsuitable as an invariant for the two reasons you mention above. The tests did not catch these problems since the tests were wrong. Turns out, your suspicion of the passing test on the Here's the problem: Back to the drawing board I guess. I'll replace the cycle invariant with the "degree vector" you're suggesting above or the one proposed in https://doi.org/10.1186/s13321-020-00453-4. Thanks again for digging into this, and for the (counter-)examples above, they led me to the problem. I needed a second pair of eyes on this! |
Closing with 77f67dc. |
Which approach does the closing commit take? Degree vector, the one proposed by Krotko, or something else? |
I had not heard of the CycleCut method before and it appears to be a non-unique cycle basis. Finding cycles using a breath-first search you essentially end up with multiple shortest paths and you get one dependant on the input order. Here are some investigations and explanation - at the very least I think it needed a clearer explanation in the manuscript.
A simple to follow case is the following:
I get 3 rings here, 2x size 6 and 1x size 12. The size 12 one will depend on the order the BFS traversed the nodes/edges.
Using the print_molecule to get the graph invariants and presuming the indexes match the SMILES I can use CXSMILES to display as follows:
To me this looks the invariant is splitting atoms which are symmetric - i.e. it's not invariant? Beyond the non-uniqueness using the cycle id like this means you also introduce an artificial split.
Useful reading on ring/cycle algorithms:
https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-6-3
https://www.combinatorics.org/ojs/index.php/eljc/article/view/v4i1r9/pdf
The text was updated successfully, but these errors were encountered: