Added documentation

casus · Jan 14, 2025 · f0738ce · f0738ce
1 parent a7d7bd3
commit f0738ce
Show file tree

Hide file tree

Showing 4 changed files with 54 additions and 11 deletions.
diff --git a/docs/source/advanced_usage/trainingmodel.rst b/docs/source/advanced_usage/trainingmodel.rst
@@ -170,10 +170,11 @@ data sets have to be saved - in-memory implementations are currently developed.
 To use the data shuffling (also shown in example
 ``advanced/ex02_shuffle_data.py``), you can use the ``DataShuffler`` class.
 
-The syntax is very easy, you create a ``DataShufller`` object,
+The syntax is very easy, you create a ``DataShuffler`` object,
 which provides the same ``add_snapshot`` functionalities as the ``DataHandler``
-object, and shuffle the data once you have added all snapshots in question,
-i.e.,
+object, and shuffle the data once you have added all snapshots in question.
+Just as with the ``DataHandler`` class, on-the-fly calculation of bispectrum
+descriptors is supported.
 
       .. code-block:: python
 
@@ -187,6 +188,14 @@ i.e.,
             data_shuffler.shuffle_snapshots(complete_save_path="../",
                                             save_name="Be_shuffled*")
 
+By using the ``shuffle_to_temporary`` keyword, you can shuffle the data to
+temporary files, which will can deleted after the training run. This is useful
+if you want to shuffle the data right before training and do not plan to re-use
+shuffled data files for multiple training runs. As detailed in
+``advanced/ex02_shuffle_data.py``, access to temporary files is provided via
+``data_shuffler.temporary_shuffled_snapshots[...]``, which is a list containing
+``mala.Snapshot`` objects.
+
 The seed ``parameters.data.shuffling_seed`` ensures reproducibility of data
 sets. The ``shuffle_snapshots`` function has a path handling ability akin to
 the ``DataConverter`` class. Further, via the ``number_of_shuffled_snapshots``

diff --git a/docs/source/basic_usage/more_data.rst b/docs/source/basic_usage/more_data.rst
@@ -49,7 +49,13 @@ MALA can be used to process raw data into ready-to-use data for ML-DFT model
 creation. For this, the ``DataConverter`` class can be used, as also shown
 in the example ``basic/ex03_preprocess_data``.
 The first thing when converting data is to select how the data should be
-processed. Up until now, MALA operates with bispectrum descriptors as
+processed. As outlined in :doc:`the training documentation <trainingmodel>`,
+there are two ways to provide descriptor data to MALA models. One can either
+precompute files containing descriptors with the ``DataConverter`` class or
+compute descriptor data on-the-fly by providing MALA generated JSON files
+containing simulation output information. These JSON files can also be
+generated by the ``DataConverter`` class.
+Up until now, MALA operates with bispectrum descriptors as
 input data (=descriptors) and LDOS as output data (=targets). Their
 calculation is calculated via
 
@@ -73,6 +79,8 @@ values are included in the energy grid upon which the LDOS is sampled,
 ``ldos_gridoffset_ev`` determines the lowest energy value sampled. These values
 are chosen for the ``pp.x`` simulation and have to be given here.
 
+If descriptors are precomputed, then hyperparameters for their calculation
+have to be provided.
 For the bispectrum calculation, ``bispectrum_cutoff`` gives the radius of
 the cutoff sphere from which information on the atomic structure is incoporated
 into the bispectrum descriptor vector at each point in space, whereas
@@ -111,7 +119,8 @@ respectively, and the ``target_units`` will always be ``"1/(Ry*Bohr^3)"``.
 The paths have to be modified accordingly. ``simulation_output_*`` refers
 to the calculation output file - MALA provides an interface to condense
 the entire, verbose simulation output to ``.json`` files for further
-processing. In the preceding section, we had to specify calculation output
+processing or on-the-fly descriptor calculation.
+In the preceding section, we had to specify calculation output
 files a number of times - instead, we can use the reduced ``.json`` files
 if we let them be created by the ``DataConverter`` class.
 

diff --git a/docs/source/basic_usage/trainingmodel.rst b/docs/source/basic_usage/trainingmodel.rst
@@ -89,10 +89,14 @@ As with any ML library, MALA is a data-driven framework. So before we can
 train a model, we need to add data. The central object to manage data for any
 MALA workflow is the ``DataHandler`` class.
 
-MALA manages data "per snapshot". One snapshot is one atomic configuration,
-for which volumetric input and output data has been calculated. Data has to
-be added to the ``DataHandler`` object per snapshot, pointing to the
-where the volumetric data files are saved on disk. This is done via
+MALA manages data "per snapshot". One snapshot is an atomic configuration with
+associated volumetric data. Snapshots have to be added to the ``DataHandler``
+object. There are two ways to provide snapshot data, which are selected by
+providing the respective types of data files.
+
+1. Precomputed descriptors: The LDOS is sampled and the volumetric descriptor
+   data is precomputed into either OpenPMD or numpy files
+   (as described :doc:`here <more_data>`), and both can be loaded for training.
 
       .. code-block:: python
 
@@ -102,12 +106,33 @@ where the volumetric data files are saved on disk. This is done via
             data_handler.add_snapshot("Be_snapshot1.in.npy", data_path,
                                       "Be_snapshot1.out.npy", data_path, "va")
 
+2. On-the-fly descriptors: The LDOS is sampled into either OpenPMD or numpy
+   files, while the volumetric descriptor data is computed on-the-fly during
+   training or shuffling. Starting point for the descriptor calculation in this
+   case is the simulation output saved in a JSON file. This mode is only
+   recommended if a GPU-enabled LAMMPS version is available. If this route is
+   used, then descriptor calculation hyperparamters need to be set before
+   adding snapshots, see :doc:`data conversion manual <more_data>` for details.
+
+      .. code-block:: python
+
+            # Bispectrum parameters.
+            parameters.descriptors.descriptor_type = "Bispectrum"
+            parameters.descriptors.bispectrum_twojmax = 10
+            parameters.descriptors.bispectrum_cutoff = 4.67637
+
+            data_handler = mala.DataHandler(parameters)
+            data_handler.add_snapshot("Be_snapshot0.info.json", data_path,
+                                      "Be_snapshot0.out.npy", data_path, "tr")
+            data_handler.add_snapshot("Be_snapshot1.info.json", data_path,
+                                      "Be_snapshot1.out.npy", data_path, "va")
+
 The ``"tr"`` and ``"va"`` flag signal that the respective snapshots are added as
 training and validation data, respectively. Training data is data the model
 is directly tuned on; validation data is data used to verify the model
 performance during the run time and make sure that no overfitting occurs.
 After data has been added to the ``DataHandler``, it has to be actually loaded
-and scaled via
+(or in the case of on-the-fly usage, computed) and scaled via
 
       .. code-block:: python
 

diff --git a/examples/basic/ex05_run_predictions.py b/examples/basic/ex05_run_predictions.py
@@ -11,7 +11,7 @@
 configurations. Either execute ex01 before executing this one or download the
 appropriate model from the provided test data repo.
 
-REQUIRES LAMMPS (and potentiall the total energy module).
+REQUIRES LAMMPS (and potentially the total energy module).
 """
 
 model_name = "Be_model"