Skip to content

Commit

Permalink
Added documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
RandomDefaultUser committed Jan 14, 2025
1 parent a7d7bd3 commit f0738ce
Show file tree
Hide file tree
Showing 4 changed files with 54 additions and 11 deletions.
15 changes: 12 additions & 3 deletions docs/source/advanced_usage/trainingmodel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,10 +170,11 @@ data sets have to be saved - in-memory implementations are currently developed.
To use the data shuffling (also shown in example
``advanced/ex02_shuffle_data.py``), you can use the ``DataShuffler`` class.

The syntax is very easy, you create a ``DataShufller`` object,
The syntax is very easy, you create a ``DataShuffler`` object,
which provides the same ``add_snapshot`` functionalities as the ``DataHandler``
object, and shuffle the data once you have added all snapshots in question,
i.e.,
object, and shuffle the data once you have added all snapshots in question.
Just as with the ``DataHandler`` class, on-the-fly calculation of bispectrum
descriptors is supported.

.. code-block:: python
Expand All @@ -187,6 +188,14 @@ i.e.,
data_shuffler.shuffle_snapshots(complete_save_path="../",
save_name="Be_shuffled*")
By using the ``shuffle_to_temporary`` keyword, you can shuffle the data to
temporary files, which will can deleted after the training run. This is useful
if you want to shuffle the data right before training and do not plan to re-use
shuffled data files for multiple training runs. As detailed in
``advanced/ex02_shuffle_data.py``, access to temporary files is provided via
``data_shuffler.temporary_shuffled_snapshots[...]``, which is a list containing
``mala.Snapshot`` objects.

The seed ``parameters.data.shuffling_seed`` ensures reproducibility of data
sets. The ``shuffle_snapshots`` function has a path handling ability akin to
the ``DataConverter`` class. Further, via the ``number_of_shuffled_snapshots``
Expand Down
13 changes: 11 additions & 2 deletions docs/source/basic_usage/more_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,13 @@ MALA can be used to process raw data into ready-to-use data for ML-DFT model
creation. For this, the ``DataConverter`` class can be used, as also shown
in the example ``basic/ex03_preprocess_data``.
The first thing when converting data is to select how the data should be
processed. Up until now, MALA operates with bispectrum descriptors as
processed. As outlined in :doc:`the training documentation <trainingmodel>`,
there are two ways to provide descriptor data to MALA models. One can either
precompute files containing descriptors with the ``DataConverter`` class or
compute descriptor data on-the-fly by providing MALA generated JSON files
containing simulation output information. These JSON files can also be
generated by the ``DataConverter`` class.
Up until now, MALA operates with bispectrum descriptors as
input data (=descriptors) and LDOS as output data (=targets). Their
calculation is calculated via

Expand All @@ -73,6 +79,8 @@ values are included in the energy grid upon which the LDOS is sampled,
``ldos_gridoffset_ev`` determines the lowest energy value sampled. These values
are chosen for the ``pp.x`` simulation and have to be given here.

If descriptors are precomputed, then hyperparameters for their calculation
have to be provided.
For the bispectrum calculation, ``bispectrum_cutoff`` gives the radius of
the cutoff sphere from which information on the atomic structure is incoporated
into the bispectrum descriptor vector at each point in space, whereas
Expand Down Expand Up @@ -111,7 +119,8 @@ respectively, and the ``target_units`` will always be ``"1/(Ry*Bohr^3)"``.
The paths have to be modified accordingly. ``simulation_output_*`` refers
to the calculation output file - MALA provides an interface to condense
the entire, verbose simulation output to ``.json`` files for further
processing. In the preceding section, we had to specify calculation output
processing or on-the-fly descriptor calculation.
In the preceding section, we had to specify calculation output
files a number of times - instead, we can use the reduced ``.json`` files
if we let them be created by the ``DataConverter`` class.

Expand Down
35 changes: 30 additions & 5 deletions docs/source/basic_usage/trainingmodel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,14 @@ As with any ML library, MALA is a data-driven framework. So before we can
train a model, we need to add data. The central object to manage data for any
MALA workflow is the ``DataHandler`` class.

MALA manages data "per snapshot". One snapshot is one atomic configuration,
for which volumetric input and output data has been calculated. Data has to
be added to the ``DataHandler`` object per snapshot, pointing to the
where the volumetric data files are saved on disk. This is done via
MALA manages data "per snapshot". One snapshot is an atomic configuration with
associated volumetric data. Snapshots have to be added to the ``DataHandler``
object. There are two ways to provide snapshot data, which are selected by
providing the respective types of data files.

1. Precomputed descriptors: The LDOS is sampled and the volumetric descriptor
data is precomputed into either OpenPMD or numpy files
(as described :doc:`here <more_data>`), and both can be loaded for training.

.. code-block:: python
Expand All @@ -102,12 +106,33 @@ where the volumetric data files are saved on disk. This is done via
data_handler.add_snapshot("Be_snapshot1.in.npy", data_path,
"Be_snapshot1.out.npy", data_path, "va")
2. On-the-fly descriptors: The LDOS is sampled into either OpenPMD or numpy
files, while the volumetric descriptor data is computed on-the-fly during
training or shuffling. Starting point for the descriptor calculation in this
case is the simulation output saved in a JSON file. This mode is only
recommended if a GPU-enabled LAMMPS version is available. If this route is
used, then descriptor calculation hyperparamters need to be set before
adding snapshots, see :doc:`data conversion manual <more_data>` for details.

.. code-block:: python
# Bispectrum parameters.
parameters.descriptors.descriptor_type = "Bispectrum"
parameters.descriptors.bispectrum_twojmax = 10
parameters.descriptors.bispectrum_cutoff = 4.67637
data_handler = mala.DataHandler(parameters)
data_handler.add_snapshot("Be_snapshot0.info.json", data_path,
"Be_snapshot0.out.npy", data_path, "tr")
data_handler.add_snapshot("Be_snapshot1.info.json", data_path,
"Be_snapshot1.out.npy", data_path, "va")
The ``"tr"`` and ``"va"`` flag signal that the respective snapshots are added as
training and validation data, respectively. Training data is data the model
is directly tuned on; validation data is data used to verify the model
performance during the run time and make sure that no overfitting occurs.
After data has been added to the ``DataHandler``, it has to be actually loaded
and scaled via
(or in the case of on-the-fly usage, computed) and scaled via

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion examples/basic/ex05_run_predictions.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
configurations. Either execute ex01 before executing this one or download the
appropriate model from the provided test data repo.
REQUIRES LAMMPS (and potentiall the total energy module).
REQUIRES LAMMPS (and potentially the total energy module).
"""

model_name = "Be_model"
Expand Down

0 comments on commit f0738ce

Please sign in to comment.