-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser for TUCAN strings using a formal grammar #73
Comments
May I suggest writing a formal grammar as part of this issue or as a prelude, and to use a grammar language that is widely-understood such as W3C EBNF? This will aid those who might want to implement a parser in a different language and/or with different tooling. It also will make the grammar accessible to a wider audience for review and comment. For details and ideas on how to implement, see this article. That link also contains information relevant to the sum formula construct. |
That's definitely a good idea. The format you use in the "Grammar" section of the linked article - is that Rust-specific or a rather generalized variant of EBNF? |
BTW, in addition to the node-associated property name/value pairs, we will also need fragment and possibly full graph-based attribute assignment. In particular the fragment (= connected group of node/atoms) property assignment will be required for handling charges. |
It's W3C EBNF, which is technology-agnostic. The best-known language to use it in its specification is XML (thus the name). There are a lot of parser generator packages, all of which use custom grammar languages. A grammar is the price of admission to using any of them, so creating a grammar introduces no new work. But the format matters, especially in the long term. If the grammar is LL(1) (Balsa and InChI are, for example) then EBNF is a solid approach to producing a technology-agnostic specification. However, certain parser generators don't necessarily read W3C EBNF directly. The one I included in the InChI Grammar package does, but there are other. Now a possibly unpopular opinion - parser generators are overrated for LL(1) languages and will eventually become technical debt. Writing a recursive descent parser offers much more debuggability and maintainability, and far greater freedom than having to deal with auto-generated code. Edit: the one exception a parser generated during language evolution. The InChI Grammar project uses a parser generator as part of a test suite to prevent regressions from being introduced as new rules are added. The intent is identical to unit tests. I outline the technique here and refine it here. There's a lot of abstract theory behind this stuff, but most of it can be disregarded when building an LL(1) parser. That's what I try to do in those articles. |
Turns out the "BNF playground" is really helpful for quick testing: https://bnfplayground.pauliankline.com but other than that, will need quite a while to digest this. |
Yes, it's quite a fire hydrant of terms and concepts. BNF playground unfortunately doesn't accept EBNF. But EBNF Test does. Unfortunately, the latter in my hands chokes on certain grammars/inputs. But it's a great tool for figuring out EBNF. I've used it a lot. |
I think I understand most of the syntax now but what's the purpose of these "control characters"(?!?!?): "." as in formula ::= "/" elemental ( "." elemental )* |
Those are terminals, aka character literal. A formula is separated from what preceded it by a forward slash. InChI allows "dot disconnection" of formula, which is the purpose of the ".". In later layers, disconnected entities are separated by a semicolon. Some examples appear in strings.txt. |
Ah, forgot about that since in TUCAN sum formula, this is not allowed - for example nickel(II) chloride with six "water of crystallization" which one would naively write as NiCl2 . 6 H2O although it really is [Ni(H2O)6]Cl2 would always be written as |
I think I got it working pretty well now - with one exception: How do I disallow an empty
|
Related problem: if "molecule" is only monoatomic, there should be no "bond block" and for bond block definition ("tuple list") the definition itself is also relatively easy with Or is this kind of check not part of formal grammer anymore, like in natural language where you can also construct formally correct sentences which are nevertheless nonsense, like the "invisible pink unicorn"? |
Furthermore, I still have problems to define that a definition is not allowed to appear twice, for example
allows |
Here's one way. I'm not saying it's a good way, but it's a way. formula ::= carbon_then | hydrogen_then
/* etc. */
carbon_then ::= c etc?
hydrogen_then ::= h etc? Basically, you define the starting element for every possible formula, which is just north of 100. There are already that many element definitions, so it just doubles the amount of busy-ness :). It's times like these that it helps to ask "at what cost"? Requiring non-empty formula is possible, but at what cost? The cost here is a lot of complexity. Actually, the Hill formula/sum formula is a tightly-wound bundle of complexity if you think about it. The problem is all of those elements. So expressing formulas precisely is non-trivial to begin with. Disallowing empty formulas raises the bar still higher.
I'd lump it into "semantics." Grammar/syntax defines the universe of possible strings. Semantics define what those strings mean. Some syntactically valid strings will have no semantic meaning. This is Ok and unavoidable.
Does the syntax require a particular ordering of Anyway, it's discussions like this that show how an early focus on tool-independent formal grammar can narrow down problem areas and expose bigger questions. |
Well, one could require node attributes such as Based on the argument of complexity, I'd possibly also go that way to keep the formal grammar relatively simple and leave the "chemical interpretation" for later processing. For example, once a TUCAN string with correct grammar is passed on by the parser, one could easily check for the empty sum formula, tuples in which both index numbers are identical or outside range (larger than number of atoms in the sum formula), fragment number higher than total fragments present, and MASS or RAD out of range to throw an error. About Still need to think about the "graph" and "user/custom" blocks, then will post a "grammar file" here |
@schatzsc My argument was that a parser for TUCAN is supposed to construct a valid graph object and this can also be achieved with a non-canonical TUCAN string.
(*) I absolutely agree that the set of possible node attributes (MASS and RAD at the moment) and their value types should be fixed because this layer is part of the identifier. This was incorrect in my initial proposal for this issue. Multiplicity:
This is also interesting in the other layers:
|
I think the formula layer should strictly adhere to the Hill system, which also seems to be easy to check with the formal grammar suggested by Richard, as it is really about counting the instances of the different elements In the tuple layer, duplicates indeed do not have an impact, since NetworkX The node properties/attributes do not need to be sorted in any way in the non-canonical TUCAN string but we need a rule if an attribute is assigned multiple times differently - invalid and giving an error, first appearance to be considered as valid or last appearance valid? |
We should keep that in mind until we have a reasonable parser implementation and then decide. From my experimentation with yacc I think all three options should be possible. |
I'd like to add a few examples to the limitations of a parser discussion. I started playing with another famous lexer/parser generator called ANTLR (which I may consider instead of PLY or lex/yacc). They also have a lot of nice grammar examples. OpenSMILES grammar:
The parser sees no problems here, but a C compiler will tell us that we cannot assign a floating point number to an integer variable. Thus, after the syntactic analysis (parsing) the compiler performs a semantic analysis, which identifies these problems. Conclusion: Examples of what could be part of a semantic analysis for TUCAN:
|
Let me proceed with a grammar proposal for the Hill system formula. This is based on @rapodaca's approach. I used only a reduced set of element symbols, which later will be extended all 118 elements. Notes:
W3C-EBNF:
ANTLR4:
With a few assumptions on the generated parse tree I already managed to extract the pairs of element symbol and count. Thus, an atom list can be constructed. |
Another defect would be to have a tuple with the same node index, for example @JanCBrammer @schatzsc
|
There is some discrepancy in the treatment of the hydrogens - the Hill system in the strict sense requires that "the number of carbon atoms in a molecule is indicated first, the number of hydrogen atoms next, and then the number of all other chemical elements subsequently, in alphabetical order of the chemical symbols. When the formula contains no carbon, all the elements, including hydrogen, are listed alphabetically." https://en.wikipedia.org/wiki/Chemical_formula However, there are also instances where hydrogen is always listed first ... |
Such a "self-reference" is not allowed and should raise an exception, although I think the canonicalization algorithm will not be affected, since it only operates on the node properties, thus a "self-loop" will only be seen as an additional "bond". But it is still better not to allow this at all, since it is physically meaningless |
Can you please check that in my draft PR? There are already tests for the grammar, i.e. testing what can be parsed or not. I'd be happy to have more examples there. The ANTLR grammar file with all elements can be found here, but it's not complete yet. |
Somehow it just shows me that the test has passed but I cannot easily see input and results?!? |
You could add something wrong to the parameter list of @pytest.mark.parametrize(
"sum_formula",
[
"CHCl3",
"ClH",
"H2",
"C123456789Zr987654321",
"HCl",
],
)
def test_can_parse_sum_formula(sum_formula):
_parse_sum_formula(sum_formula) will fail with
|
I'm still not sure about the order of H in case of no carbon atoms. It seems to be a bit weird although formally correct that ClH is correct but HCl not, which is due to the alphabetical order, as HNO3 would be more natural. Interestingly, sulfuric acid would also be according to Hill HO4S. Another important test would be element names which start with the same letter but then are one- vs. two-letter: "A list of formulae in Hill system order is arranged alphabetically, as above, with single-letter elements coming before two-letter symbols when the symbols begin with the same letter (so "B" comes before "Be", which comes before "Br")" |
I'm just sticking to these rules. 😉 How could this "human readable" formula be expressed? It's not "take C first, then H, then all other elements ordered by their atomic number".
This is exactly what |
German Wikipedia explains it: sorting by electronegativity. |
Running into troubles in the roundtrip tests (molfile -> graph -> TUCAN (using
I'm not tempted to add |
We should definitely not go down this "sorting by electronegativity" road. Let's stay with the Hill system and decide whether to always put the hydrogen before all other non-C/non-H elements or sort it by alphabet when no carbon is present. |
In contrast to |
What's the status on this decision? |
Keep the Hill formula rules. This is consistent with the existing implementation in |
As with many things in cheminformatics and beyond, use is somewhat inconsistent and you can have long discussion what is more "natural" but yes, let's stay with published Hill rules. |
To be clear, the path forward would be to encode the formula layer using Hill order. Carbon-free formulas list symbols alphabetically and carbon-containing formulas start with C, then H, then list the rest alphabetically. The only time H appears first is when (1) there is no C; and (2) there are no lexicographically smaller elements. This complexity would be handled at the level of syntax (rather than semantics). Yes so far? I ask because this was a stumbling point for me in learning TUCAN. I thought that elements were mapped to indexes in Hill order, when they were actually mapped in atomic number order. To actually perform the mapping an implementation needs to read the Hill formula, sort by ascending atomic number, then map. I suspect the burning question on the mind of anyone working with this format would be: why not just list the element symbols by ascending atomic number? If changing the formula ordering were on the table, a case could be made to use atomic number order for that reason alone, not to mention syntactic simplification. It may look "unnatural" but would align with how TUCAN is set up now. |
Thank you very much for your input - always highly appreciated. Increase of node index with atomic number is a key element of TUCAN and helps for example with analyzing both metal coordination environments, as these tend to have the highest index (with the exception of Li and Be) and H always the lowest. Is also used in some experimental features not in the official repository so far such as tautomer detection and implicit to explicit H pre-processor check, where CSD structures are taken, "pruned" of the lowest index non-bridging Hs, then Hs re-added by the pre-processor and Tucan strings before/after compared, if pre-processor works both have to be identical, which is acutally the case with approx. 80% of the non-polymeric, non-disordered organic subset (about 425.000 out of a total of 1.2 mio. structures) - and the 20% which are not correctly handled yet are mainly boron clusters and other main group element compounds with unusual "valences" (= coordination numbers) where algorithmic addition becomes difficult. However, the actual order of elements in the sum formal does not matter as long as you can easily separate the individual elements and their stoichiometric coefficients. Fragment from my "TUCAN playground":
So what is done is kind of a "reverse lookup" where the ELEMENT_PROPS dictionary is scanned and then for each element checked if it was present in the sum formula and then a number of nodes added equivalent to the stoichiometric coefficient stored in element_count Since the order of the addition is governed by the order of elements in ELEMENT_PROPS, which by itself is ordered by increasing atomic number https://github.com/TUCAN-nest/TUCAN/blob/main/tucan/element_properties.py, it does not matter how they are ordered in the element_count, since the loop just "pulls out" whatever element is up for addition now. Therefore, although ClH and H2O4S look strange to me, I guess the simplest thing is to just stick to the Hill order, but also more for consistency. Not sure what happens with element_count construction if there are multiple instances of an element in the sum formula such like HOS4H but that should not be allowed anyway. |
+1 for the proposed simplification. |
I'd like to strongly suggest to stick with the system that has been in place for 120 years now and which is the Hill order. It's what most chemists are used to, it is not that terribly difficult, and internally, it does not matter at all due to the "reverse look-up" via the ELEMENT_PROPS dictionary as outlined above. |
Implement a parser for TUCAN strings which generates the corresponding NetworkX Graph. The parser should be generated from a grammar file using ANTLR4.
API changes:
graph_from_tucan(tucan: str) -> nx.Graph
to thetucan.io
moduleDependency changes
TUCAN format:
See discussion below.
Tasks:
The text was updated successfully, but these errors were encountered: