[WIP] Refactoring to include outputs parsing #80

gbrunin · 2024-09-09T13:14:43Z

I have changed the structure of the code to better reflect the distinction between input and output parsers. Concerning the inputs, all the tests that were already there still pass, it's only a matter of adapting the namespace. For the outputs, I created objects such as PwOutput, that inherit from an abstract BaseOutput, that for now can be instantiated with a from_dir method (a from_files method could/should be added as well). A user with a job that ran in a given directory could get the outputs easily using this classmethod. In these from_dir methods, specific XML and/or standard output Parsers would be used to get the results. Each Parser would parse a single file, and the logic of parsing and extracting the outputs from the different codes would have to be implemented in each from_dir method. For instance, a NebOutput.from_dir would parse the standard output of the global computation and probably the standard outputs and/or XML files for each image. The extracted outputs would be stored as a simple dictionary and these objects would not rely on any external package.
Then, in qe_tools.extractors, ASE and pymatgen objects could be constructed (e.g., ase.Atoms/pymatgen.Structure, band structures,...), allowing each to be optional dependencies.

This is only the base logic of the new structure and many things remain to be implemented. The idea now is to see what breaks with this in aiida-quantumespresso and how the parsing could be moved from there to here. Then, more will be added depending on the needs.

…s. Made many functions as public since they could be useful to users and outside of their context.

…onversions should take place. In all other files, ase and pymatgen should not be used.

…ibility regarding the number of output files for each type of computation, e.g. NEB.

Several small changes are made to the `PwOutput` class: 1. Use `pathlib` instead of `os.path` for path definitions. 2. `.from_dir` method: Only look for files from the specified directory using list comprehension. 3. Adapt approach for identifying the `stdout` file to only use the first 5 lines to look for the `Program PWSCF` string. 4. Remove the try-except loop that simply catches a general exception and raises a `ValueError` instead. This would obscure the actual error without an immediate benefit. 5. Define the `d_out_xxx` variables as empty dictionaries before parsing, and redefine them in case the data is parsed successfully. 6. Allow the user to pass a `Path` instead of a string to specify the directory to parse from.

Currently the `PwStdoutParser` nor its parent class `BaseOutputFileParser` parse any of the Quantum ESPRESSO `stdout`. Here we add the `BaseStdoutParser` class that has some very minimal parsing features for parsing the code name and version, as well as the walltime of the calculation. A utility function is also added to convert the Quantum ESPRESSO walltime output into seconds. The constructor of the `BaseOutputFileParser` is also adapted to remove the default to `None`. Since there is currently no way of providing the output content after construction, it should be provided at that time for the class to be useful.

The first schema used for the XML output of Quantum ESPRESSO `pw.x` was still missing. Here it is added, along with a set of tests for the XML parsing of default runs.

mbercx · 2024-11-21T14:44:08Z

@gbrunin thanks again for the hard work in restructuring the package. I've added some basic stdout parsing, tests etc, and fixed up the package to run the CI properly on the currently active Python versions (3.9 - 3.13).

Now I'm adding some basic documentation, also on the design decisions. We should consider these carefully, particularly the UX, since it's hard to change these afterwards. I don't care too much about breaking backwards compatibility if we can converge on a more user-friendly tool.

mbercx · 2024-11-21T15:14:58Z

Also quick shoutout to @elinscott and his package for QE inputs:

https://github.com/elinscott/qe_input_prototype

He's already indicated he'd be fine with us integrating that work into qe-tools, so we should consider it in our design discussions.

mbercx · 2024-11-21T22:58:31Z

@gbrunin I've also added some basic documentation and design notes. I'm still not quite sure about some of the APIs. I do like the general design now though, I'm just also wondering if:

We should also add one class that can deal with both inputs and outputs. IIRC some of the output parsing requires, or can be significantly facilitated by having access to the inputs. Since we already use the "parser" nomenclature for the output parsers, I might call this e.g. PwCalc, and then it contains both the inputs and outputs.
In time, we want a PwInput class, that allows for both generating an input file and parsing an existing one. I'm not sure if in this case we need to have the Parser class intermediary.
We should already work a bit on separating the raw parsed data of the XML and actual useful outputs. The raw data is not really that user-friendly, clearly.

I've also made a little image to represent the design:

…ec to utils. Renamed BaseStdoutParser.parse_stdout_base to parse for compatibility with parent abstract class.

This currently serves no real purpose, as the executable can be easily identified from the class itself. In case we do find a use case later, it can always be added again easily.

davidwaroquiers · 2025-01-31T14:52:05Z

pyproject.toml

    'numpy',
    'scipy',
    'packaging',
+    'xmlschema',
 ]


Are all these necessary ? I think it would be best to restrict to the strict minimum. I would say that for parsing or generating input even numpy should not be needed (let alone scipy ?). I'd be in favor of removing them but I don't know where/why they are used. For the packaging dependency, it has itself no dependency so I think it's probably fine to keep it even though it's probably possible to get around without it.. For xmlschema, I would also say it seems fine as xmlschema has just one dependency itself, elementpath, which itself has no dependency. Any thoughts on numpy/scipy removal ?

Just looked, scipy is used only to get the norm of a few vectors in one file. I think this is definitely something that should then probably be removed from the dependencies (also from the fact that if numpy is kept as a required dependency, the norm can also be obtained by numpy, but for numpy, my comment below holds too :)).
As for numpy, it's used in a few more places, but 1/ for very simple maths, vector operations and 2/ for some conversion functions. For the latter point, maybe the "lowest-level" qe tools could be installed without it (the "client" code, be it aiida, pymatgen or else) has to follow the exact API that the lowest level functions require. And numpy could be an optional dependency for "activating" helper functions if the user wants it. Any thoughts ?

mbercx force-pushed the develop branch 3 times, most recently from 6d45efa to 35a667f Compare November 21, 2024 12:41

mbercx and others added 12 commits November 21, 2024 14:09

✨ First draft of PwParser

e95d85f

Modified the structure of the code to split between inputs and output…

17dfd54

…s. Made many functions as public since they could be useful to users and outside of their context.

Refactored structure with inputs/outputs.

9283644

Abstraction of base outputs, added ase and pymatgen files where the c…

3135fc4

…onversions should take place. In all other files, ase and pymatgen should not be used.

Added pycharm stuff in gitignore.

2ffcbc7

Added general parse method to output files.

bbc3e9e

Updated imports in tests.

d73b0f0

Added pymatgen and ase checks in respective files.

8ccf333

Refactoring with Outputs objects relying on Parsers. This allows flex…

3e58f13

…ibility regarding the number of output files for each type of computation, e.g. NEB.

✨🧪 Add missing qes-1.0.xsd schema and XML parsing tests

a72eb3f

The first schema used for the XML output of Quantum ESPRESSO `pw.x` was still missing. Here it is added, along with a set of tests for the XML parsing of default runs.

mbercx force-pushed the develop branch from 35a667f to 35a1cbc Compare November 21, 2024 13:10

🔧 Update dependencies; fix pre-commit & tests

de4f8c0

mbercx force-pushed the develop branch from f362f5d to de4f8c0 Compare November 21, 2024 13:54

📚 [WIP] Add some basic documentation and design notes

c31f570

mbercx mentioned this pull request Nov 21, 2024

✨ Add new PwParser class #79

Closed

gbrunin and others added 2 commits November 22, 2024 11:01

Use annotations from future for cleaner type hint. Moved qe_time_to_s…

90b3608

…ec to utils. Renamed BaseStdoutParser.parse_stdout_base to parse for compatibility with parent abstract class.

👌 Remove the executable input from the output classes

c737b43

This currently serves no real purpose, as the executable can be easily identified from the class itself. In case we do find a use case later, it can always be added again easily.

gbrunin mentioned this pull request Jan 31, 2025

Combining efforts with qe-tools Griffin-Group/pymatgen-io-espresso#24

Open

davidwaroquiers reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Refactoring to include outputs parsing #80

[WIP] Refactoring to include outputs parsing #80

gbrunin commented Sep 9, 2024

mbercx commented Nov 21, 2024

mbercx commented Nov 21, 2024

mbercx commented Nov 21, 2024

davidwaroquiers Jan 31, 2025

davidwaroquiers Jan 31, 2025 •

edited

Loading

[WIP] Refactoring to include outputs parsing #80

Are you sure you want to change the base?

[WIP] Refactoring to include outputs parsing #80

Conversation

gbrunin commented Sep 9, 2024

mbercx commented Nov 21, 2024

mbercx commented Nov 21, 2024

mbercx commented Nov 21, 2024

davidwaroquiers Jan 31, 2025

Choose a reason for hiding this comment

davidwaroquiers Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

davidwaroquiers Jan 31, 2025 •

edited

Loading