From 689f11de760fa5d79f9b8db965d9dd9b9bbbd7e2 Mon Sep 17 00:00:00 2001 From: melt Date: Tue, 29 Oct 2024 10:32:31 +0000 Subject: [PATCH 1/8] docs: update installation instructions for CAM with FTorch --- README.md | 164 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 98 insertions(+), 66 deletions(-) diff --git a/README.md b/README.md index 5e0d00bac9..8a234d08d4 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,6 @@ developed as part of the DataWave project. The main working branch for this project is `datawave_ml` which was originally branched from the `cam6_3_139` tag. - ## Using this model ### Obtaining CAM @@ -16,40 +15,39 @@ branch on which this work is based: ``` git clone https://github.com/DataWaveProject/CAM.git cd CAM -git checkout datawave_ml +git checkout coupling ``` This branch is built upon the `cam6_3_139` tag from the [main ESMCOMP/CAM repository](https://github.com/ESCOMP/CAM). -### Preparing FTorch and setting the CIME component in CESM +### Obtaining `FTorch` -We also need to build and link _FTorch_ which will allow us to use our PyTorch-based -neural nets in CAM. This has two steps. +To use PyTorch-based neural nets in CAM, we first need to build and link `FTorch`. -#### Building FTorch +To install `FTorch` on Derecho follow the instructions in section [`FTorch` on Derecho](#ftorch-on-derecho) below. -You need to build and install _FTorch_ locally on the system following the instructions -[in the documentation](https://github.com/Cambridge-ICCS/FTorch). -Note the location of the install as this will be required later when building CAM. +> [!NOTE] +> The location of the `FTorch` install will be required later when building CAM. -For specifics on building FTorch on Derecho to be compatible with CAM see the section -[_FTorch_ on Derecho](#ftorch-on-derecho) below. +> [!NOTE] +> If you want to build `FTorch` on another system, please follow the general instructions in the `FTorch` +> [documentation](https://github.com/Cambridge-ICCS/FTorch). +### Obtaining `FTorch`-compatible CIME -#### Obtaining _FTorch_-compatible CIME +This fork of CAM will use an `FTorch`-compatible version of the CIME buildsystem. This is specified under the `[cime]` section +of the `Externals.cfg` file in the main CAM directory. -This fork of CAM will use an FTorch-compatible version of the CIME buildsystem. -This is specified under the `[cime]` section of the `Externals.cfg` file in the -main CAM directory. For more information see -[_FTorch_-compatible CIME](#ftorch-compatible-cime) below. +To set up a CIME to use `FTorch` please follow the instructions in section [`FTorch`-compatible CIME](#ftorch-compatible-cime) +below. ### Checkout externals -You can now run, from within the CAM root directory, +To fetch the external components run the following command from within the CAM root directory, + ``` ./manage_externals/checkout_externals ``` -to fetch the external components. ### Creating and running a case @@ -63,43 +61,65 @@ from `CAM/cime/scripts/`. You can then navigate to the case directory at ``. -#### Building with _FTorch_ +### Build CAM with `FTorch` -To couple to FTorch modify `Tools/Makefile` line 600 to set the environment variable -FTORCH_LIB to the location of the FTorch library on your system. +Before we can run `./case.build` we first need to make some manual changes to the `Makefile` located in +`path_to_testcase_directory>/Tools/Makefile`. This will allow the CIME build system to locate `FTorch`. -On Derecho this is `/path/to/ftorch/bin/ftorch_intel` as defined below in -[_FTorch_ on Derecho](#ftorch-on-derecho). +From the test case directory, modify `Tools/Makefile` line 602 to set the environment variable `FTORCH_LIB` to the location of +the `FTorch` library on your system. -#### Setting up case details +If you followed the instructions in section [`FTorch` on Derecho](#ftorch-on-derecho) this will be `$HOME/FTorch/bin/ftorch_intel`. -We can now run `./case.setup` from within the case directory. -Once this has been done then edit the generated `user_nl_cam` in the case directory -as required. -Add the following lines: +```make +FTORCH_LIB := $HOME/FTorch/bin/ftorch_intel +``` -#. `gw_convect_dp_ml='on'`\ - This is the switch to use our new ML convective-gw scheme instead of the default.\ - Other options are `'off'` (default - use original), `'bothoff'` (run both schemes - but use default for simulation), and `'bothon'` (run both but use ML for simulation). -#. `gw_convect_dp_ml_net=''`\ - The path to the your saved PyTorch model. +### Setting up case details -Also consider adding: -``` -fincl = 'MYVAR' +We can now run `./case.setup` from within the case directory. Once this has been done then edit the generated `user_nl_cam` in +the case directory as required. Add the following lines: + +```fortran +gw_convect_dp_ml=.false. +gw_convect_dp_ml_compare=.true. +gw_convect_dp_ml_net_path='/path/to/neural/net' +gw_convect_dp_ml_norms='/path/to/norms' ``` -to generate output diagnostics of variables as desired. + +* `gw_convect_dp_ml` (`logical`) + + Whether or not to use the ML scheme for gravity waves produced by deep convection. Default: `.false.` + +* `gw_convect_dp_ml_compare` (`logical`) + + Whether or not to run a piggybacking comparison of the ML deep convection gravity waves to the original scheme. Only one + scheme will be used to advance the simulation as dictated by `gw_convect_deep_ml`. Default: `.false.` + +* `gw_convect_dp_ml_net_path` + + Absolute filepath to the deep convection gravity wave neural net used when `gw_convect_dp_ml` is set to `.true.` (`.pt` + extension). + +* `gw_convect_dp_ml_norms` + + Absolute filepath to the deep convection gravity wave normalisation weights (NetCDF) used when `gw_convect_dp_ml` is set to `.true.`. + +> [!TIP] +> Consider adding the following to generate output diagnostics of variables as desired. +> +> ```fortran +> fincl = 'MYVAR' +> ``` We can then run `./case.build` from within the case directory to build the model. The case can be run with `./case.submit` from the case directory. -**Note:**\ -By default CESM will place output in `/glade/scratch/user/case/` -and logs/restart files in `/glade/scratch/user/archive/case/`. -To place all output with logs in `archive/case` switch 'short term archiving' on by -editing `env_run.xml` in the case directory to change `DOUT_S` from `FALSE` to `TRUE`. +> [!NOTE] +> By default CESM will place output in `$SCRATCH/case/` and logs/restart files in `$SCRATCH/archive/case/`. +> To place all output with logs in `archive/case` switch 'short term archiving' on by running `./xmlchange DOUT_S=FALSE` in the +> case directory to change `DOUT_S` from `TRUE` to `FALSE`. ## NOTE: This is **unsupported** development code and is subject to the [CESM developer's agreement](http://www.cgd.ucar.edu/cseg/development-code.html). @@ -109,37 +129,48 @@ editing `env_run.xml` in the case directory to change `DOUT_S` from `FALSE` to ` Please see the [wiki](https://github.com/ESCOMP/CAM/wiki) for complete documentation on CAM, getting started with git and how to contribute to CAM's development. -### _FTorch_ on Derecho +### `FTorch` on Derecho -The following steps can be followed to ensure FTorch is built to be consistent +The following steps can be followed to ensure `FTorch` is built to be consistent with CAM on Derecho. -On Derecho `libtorch` should be loaded using -``` -module load libtorch/2.1.2 -``` -and used to build _FTorch_.\ +#### load CAM environment -Further, for compatibility with CAM we need to be specific about the environment and -compilers we load. -The following sequence of modules are required to build FTorch compatible with the intel -build of CAM on Derecho: -``` +For compatibility with CAM we need to be specific about the environment and compilers we load. The following sequence of modules +are required to build `FTorch` compatible with the intel build of CAM on Derecho: + +```bash module purge module load ncarenv/23.06 module load intel-oneapi/2023.0.0 module load mkl module load cmake -module load libtorch/2.1.2 module load cuda/11.7.1 ``` -Note that in future builds or releases, or on different machines, the environment for -building CAM may change. -In this case the FTorch environment should be updated accordingly. -FTorch can then be built and installed from `/path/to/ftorch/src/build/` as described in the -documentation with: +> [!NOTE] +> In future builds or releases, or on different machines, the environment for building CAM may change. In this case the `FTorch` +> environment should be updated accordingly. + +> [!NOTE] +> When building `FTorch` with `cmake` in the next step, we use the absolute path to `libtorch`. Previously, we loaded the +> `libtorch` module i.e., `module load libtorch/2.1.2` but this conflicts with `ncarenv/23.06`. We do not actually need to load +> the `libtorch` module to use the absolute path, so for now we can ignore this. + +#### obtain Ftorch source + +```bash +cd $HOME +git clone git@github.com:Cambridge-ICCS/FTorch.git +cd $HOME/FTorch/src ``` + +#### build Ftorch + +`FTorch` can then be built and installed from `$HOME/Ftorch/src` as described in the documentation with: + +```bash +mkdir -p build && cd build cmake .. \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_Fortran_COMPILER=ifort \ @@ -147,19 +178,19 @@ cmake .. \ -DCMAKE_CXX_COMPILER=icpc \ -DCMAKE_PREFIX_PATH=/glade/u/apps/opt/libtorch/2.1.2 \ -DCMAKE_INSTALL_PREFIX=../../bin/ftorch_intel/ - cmake --build . --target install ``` -This will build FTorch and install it to `/path/to/ftorch/bin/ftorch_intel`. +This will build `FTorch` and install it to `$HOME/Ftorch/bin/ftorch_intel`. -### _FTorch_-compatible CIME +### `FTorch`-compatible CIME We need to use a version of the CIME build system that is capable of linking -our code to FTorch when building CAM. +our code to `FTorch` when building CAM. To do this we have modified the `Externals.cfg` file in the main CAM directory to replace the CIME entry with: + ``` [cime] branch = ftorch_gw @@ -168,8 +199,9 @@ repo_url = https://github.com/Cambridge-ICCS/cime_je local_path = cime required = True ``` + which points to the [ICCS fork](https://github.com/Cambridge-ICCS/cime_je) of CIME -that allows components to be built with FTorch. +that allows components to be built with `FTorch`. Specifically it points to a branch based off of the `cime6.0.175` tag that is compatible with the latest version of CIME used with this version of CAM (this is the cime tag From 90f77d0594b772b7655a9953144739b03f9c55d7 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:43:55 +0000 Subject: [PATCH 2/8] docs: update FTorch build instructions in README.md Co-authored-by: Jack Atkinson <109271713+jatkinson1000@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8a234d08d4..9097223250 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ This branch is built upon the `cam6_3_139` tag from the ### Obtaining `FTorch` -To use PyTorch-based neural nets in CAM, we first need to build and link `FTorch`. +To use PyTorch-based neural nets in CAM, we use [`FTorch`](https://github.com/Cambridge-ICCS/FTorch) which needs to be built on the system before we build CAM. To install `FTorch` on Derecho follow the instructions in section [`FTorch` on Derecho](#ftorch-on-derecho) below. From abaabee579c2edd8831d736489e59cd8e2f27520 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:44:40 +0000 Subject: [PATCH 3/8] docs: fix typo in README.md Co-authored-by: Jack Atkinson <109271713+jatkinson1000@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9097223250..c4315bdf09 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ You can then navigate to the case directory at ``. ### Build CAM with `FTorch` Before we can run `./case.build` we first need to make some manual changes to the `Makefile` located in -`path_to_testcase_directory>/Tools/Makefile`. This will allow the CIME build system to locate `FTorch`. +`/Tools/Makefile`. This will allow the CIME build system to locate `FTorch`. From the test case directory, modify `Tools/Makefile` line 602 to set the environment variable `FTORCH_LIB` to the location of the `FTorch` library on your system. From 9a3629e9ce9d8f524b570c59c1267be5c64b9590 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:45:50 +0000 Subject: [PATCH 4/8] docs: better explanation of `DOUT_S` xml parameter Co-authored-by: Jack Atkinson <109271713+jatkinson1000@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c4315bdf09..6e4d926807 100644 --- a/README.md +++ b/README.md @@ -117,7 +117,7 @@ We can then run `./case.build` from within the case directory to build the model The case can be run with `./case.submit` from the case directory. > [!NOTE] -> By default CESM will place output in `$SCRATCH/case/` and logs/restart files in `$SCRATCH/archive/case/`. +> By default CESM will place outputs in `$SCRATCH/case/` essential parts of which will be moved to `$SCRATCH/archive/case/` after run completion. > To place all output with logs in `archive/case` switch 'short term archiving' on by running `./xmlchange DOUT_S=FALSE` in the > case directory to change `DOUT_S` from `TRUE` to `FALSE`. From 967ae722aba39165f195efe8e0250d6a383f2da0 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:46:55 +0000 Subject: [PATCH 5/8] docs: correct explanation of setting `DOUT_S=FALSE` Co-authored-by: Jack Atkinson <109271713+jatkinson1000@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6e4d926807..acf5e7535d 100644 --- a/README.md +++ b/README.md @@ -118,7 +118,7 @@ The case can be run with `./case.submit` from the case directory. > [!NOTE] > By default CESM will place outputs in `$SCRATCH/case/` essential parts of which will be moved to `$SCRATCH/archive/case/` after run completion. -> To place all output with logs in `archive/case` switch 'short term archiving' on by running `./xmlchange DOUT_S=FALSE` in the +> To leave all output in `$SCRATCH/case/` switch 'short term archiving' off by running `./xmlchange DOUT_S=FALSE` in the > case directory to change `DOUT_S` from `TRUE` to `FALSE`. ## NOTE: This is **unsupported** development code and is subject to the [CESM developer's agreement](http://www.cgd.ucar.edu/cseg/development-code.html). From 1abe4c93ea87a80edf8d2bed82f503cf15af58a5 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:48:00 +0000 Subject: [PATCH 6/8] docs: better description of CAM version specifics Co-authored-by: Jack Atkinson <109271713+jatkinson1000@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index acf5e7535d..3b7f821939 100644 --- a/README.md +++ b/README.md @@ -136,7 +136,7 @@ with CAM on Derecho. #### load CAM environment -For compatibility with CAM we need to be specific about the environment and compilers we load. The following sequence of modules +For compatibility with the version of CAM we are using (branched from the `cam6_3_139` tag) we need to be specific about the environment and compilers we load. The following sequence of modules are required to build `FTorch` compatible with the intel build of CAM on Derecho: ```bash From 269a5d0b592b1285b415601819a42795128ade76 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:50:02 +0000 Subject: [PATCH 7/8] docs: remove unnecessary info about libtorch module --- README.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/README.md b/README.md index 3b7f821939..0e499e7e01 100644 --- a/README.md +++ b/README.md @@ -152,11 +152,6 @@ module load cuda/11.7.1 > In future builds or releases, or on different machines, the environment for building CAM may change. In this case the `FTorch` > environment should be updated accordingly. -> [!NOTE] -> When building `FTorch` with `cmake` in the next step, we use the absolute path to `libtorch`. Previously, we loaded the -> `libtorch` module i.e., `module load libtorch/2.1.2` but this conflicts with `ncarenv/23.06`. We do not actually need to load -> the `libtorch` module to use the absolute path, so for now we can ignore this. - #### obtain Ftorch source ```bash From f8c4ae9d6466d2631e998396ba86e6944f2838d7 Mon Sep 17 00:00:00 2001 From: tommelt Date: Mon, 4 Nov 2024 15:57:54 +0000 Subject: [PATCH 8/8] docs: add more description to the NN configuration --- README.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 0e499e7e01..adc5a19f18 100644 --- a/README.md +++ b/README.md @@ -78,10 +78,15 @@ FTORCH_LIB := $HOME/FTorch/bin/ftorch_intel ### Setting up case details We can now run `./case.setup` from within the case directory. Once this has been done then edit the generated `user_nl_cam` in -the case directory as required. Add the following lines: +the case directory as required. +> [!NOTE] +> The following settings are provided as an example. These should be tailored to your particular experiment. For more +> information, please see the descriptions below. + +To run CAM using the NN to predict gravity waves and the physics-based model to _piggyback_, we can set the following settings: ```fortran -gw_convect_dp_ml=.false. +gw_convect_dp_ml=.true. gw_convect_dp_ml_compare=.true. gw_convect_dp_ml_net_path='/path/to/neural/net' gw_convect_dp_ml_norms='/path/to/norms'