Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains #871

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Jul 28, 2023

DESCRIPTION OF CHANGES:

This represents the final round of WE2E test improvements related to Issue #587, including new tests, additional features, script improvements, and removal of unsupported domains.

Script improvements

run_WE2E_tests.py

  • Adds ability to run a group of all tests in a subdirectory of test_configs
  • Replace --use_cron_to_relaunch with a --launch argument that can take the values "python", "cron", or "none" (the last of which will create the experiments but not run them)

monitor_jobs.py

  • Adds --mode flag, which if specified as "advance" will only run rocotorun once for each experiment, then quit.

generate_FV3LAM_wflow.py and set_FV3nml_sfc_climo_filenames.py

  • Add --debug flag that will give more verbose output. No longer read "VERBOSE" variable from config.yaml for this script.
  • Make most prints "debug only"

get_crontab_contents.py

  • Overhaul script to remove global variables, add arguments as needed
  • Fixes bug where submitting jobs from cron won't work if your crontab is empty (issue A cron job can only be added if a crontab already exists. #876)
  • Change functionality so that the script will remove the specified line from crontab if "--remove" flag is provided, otherwise it prints the crontab contents

New tests

  • Several new custom domain tests are added in a new custom_grids directory. The new custom domains were chosen to span a variety of locations, terrain types, and dates. Aside from custom_ESGgrid_Great_Lakes_snow_8km, these are basic tests that don't use many non-default settings, and so are good candidates for testing new SRW capabilities in the future. See the "Documentation" section for more details.
  • A new "long forecast" test (108 hours) starting at 2023060112 that retrieves FV3GFS grib2 input data from AWS (and so can not be run on Cheyenne).

Additional changes

  • Removes unsupported domains:
    • EMC_AK
    • EMC_HI
    • EMC_PR
    • EMC_GU
    • GSL_HAFSV0.A_25km
    • GSL_HAFSV0.A_13km
    • GSL_HAFSV0.A_3km
    • GSD_HRRR_AK_50km
  • For HPSS tests, ensure they all are explicitly set to look for HPSS data only
  • Test custom_ESGgrid_Great_Lakes_snow_8km (introduced in [develop] Consolidate verification tasks using retrieve_data.py #864) moved from verification to custom_tests, with a symbolic link (test_configs/verification/config.MET_verification_winter_wx.yaml) left behind. Also the ICs/LBCs are now from RAP output, retrieved from HPSS (since the observation data requires HPSS access anyway this is fine)
  • Make new comprehensive test file for Gaea (symlink to Orion's file) since that machine also does not have HPSS access
  • Various comment fixes

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • Removes old non-supported domains.
    • Deprecates --use_cron_to_relaunch argument to run_WE2E_tests.py (gives user a verbose message about replacement flag)
  • This change requires a documentation update
    • Need to document new tests, change of script behavior, removed domains

TESTS CONDUCTED:

The new script behavior was tested for each of the new flags, ensured that they behaved as expected.

  • hera.intel
    • Ran comprehensive tests on Hera after the most recent rebase. Found two failures: GST_release_public_v1 (expected) and nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR (still investigating, but I suspect this is pre-existing)
  • orion.intel
  • cheyenne.intel
    • Ran all new WE2E tests successfully
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)
    • hera.intel

DEPENDENCIES:

This branch currently contains changes from #859 and #864, and so should not be merged until those PRs are approved. Merged!

DOCUMENTATION:

Updated command-line arguments were documented within the scripts. Additional documentation for the updates was provided as part of PR #880.

Here are some details/sample plots for the new custom domains:

New test plot1 plot2
custom_ESGgrid_Central_Asia_3km: A 6-hour forecast starting 2019070100 over central Asia, including much of the Tibetan Plateau and Tarim Basin; this area was chosen for its extreme topography and climate range, with no ocean points. Uses staged FV3GFS nemsio data. ncview pressfc f006 ncview tmp lev1 f006
custom_ESGgrid_Great_Lakes_snow_8km: A 6-hour forecast starting 2023021700 over the North American Great Lakes region. Uses RAP netCDF data retrieved from HPSS, and tests a non-standard resolution (8 km). Tests deterministic verification for snow accumulation (as well as other variables), retrieving observations from HPSS (therefore can only be run on Jet and Hera). ncview frozr f006 ncview tmpsfc f006
custom_ESGgrid_IndianOcean_6km: A 12-hour forecast starting 2019061518 over the central Indian Ocean; this area was chosen for its lack of land points and as a southern hemisphere test. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (6 km). ncview pressfc f012 ncview ugrd10m f012
custom_ESGgrid_NewZealand_3km: A 6-hour forecast starting 2019061518 over New Zealand; this area was chosen for its extreme topography and mix of land/ocean points, as well as spanning the international date line. Uses staged FV3GFS grib2 data. ncview tsnowp f006 ncview uswrf f006
custom_ESGgrid_Peru_12km: A 12-hour forecast starting 2019061500 over western South America, centered over Peru; this area was chosen for its extreme topography and mix of land/ocean/lake points, as well as spanning the equator. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (12 km). ncview spfh2m f012 ncview tmpsfc f012
custom_ESGgrid_SF_1p1km: A 6-hour forecast starting 2019061500 over central coastal California, centered over the San Francisco Bay area. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (1.1 km). ncview spfhmax_max2m f006 ncview tmin_min2m f006

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS:

Thanks to @gsketefian for providing some custom grids for new tests

@mkavulich mkavulich changed the title [develop] Improvements for WE2E tests: script features, additional tests [develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains Jul 28, 2023
…etatasks that do not exist or have been removed by other checks.
 - Remove extraneous empty metatask dependencies from
MET_verification_only_vx test now that these are handled in setup.py
 - Add ability to specify a subdirectory of test_configs to run all
tests in it
 - Move custom grid yamls to separate directory
non-config files are skipped in 'all' and directory tests
…is most expensive at ~60 core hours, rest are 30 or less.
…t can take the values "python", "cron", or "none"

 - Change names in argparse section to shorten lines
 - Update custom grids for more appropriate physics suites
 - Speed up Peru 12km case with more nodes and shorter forecast time
 - Distribute new custom domains to comprehensive and coverage tests
symlink
 - Unless running in debug mode, call "set_template" with "quiet" flag
for more verbose output
 - Also add debug flag for set_FV3nml_sfc_climo_filenames.py
 - Convert most print messages to debug-only for cleaner output
 - Do not read VERBOSE flag for generate script; this flag should only
affect experiments, not initial scripts
…that will only run rocotorun once for each experiment rather than continuously monitoring
@mkavulich mkavulich force-pushed the feature/WE2E_test_improvement_round_3_rebased branch from 20fdb32 to 2b9ac58 Compare August 16, 2023 00:18
@mkavulich mkavulich force-pushed the feature/WE2E_test_improvement_round_3_rebased branch from 567c5ca to 7567188 Compare August 16, 2023 02:25
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR
#nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Should we remove the "nco_" prefix from test names now that tests can be run in either NCO or community mode without having to specify the mode in the experiment config file?

Copy link
Collaborator Author

@mkavulich mkavulich Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including "nco" in these test names is still necessary, since they are explicitly run in NCO mode unless "community" is specified on the command line (or if it's a a community-mode coverage test). Now, whether or not these tests need to still exist is another question; it doesn't look like any of them are unique, so we could eliminate them altogether so long as all those capabilities are tested as part of the Hera NCO coverage suite. I think it's a good idea to remove the "nco_" tests entirely, so if you agree then I'll go ahead and do that. @MichaelLueken should probably approve of that plan as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take back one minor point: there is one thing that is only tested in the "nco_" tests, and that is using pre-staged grid/orog/sfc_climo data. I can't remember, is that something that should be specific to NCO tasks, or can we roll that into another test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it this morning, I think we should defer this topic to a different PR, especially since @MichaelLueken is on leave this week. I opened an issue #895 to discuss this, feel free to chime in there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich A couple of things:

  1. I see 4 tests that start with "nco_" under ufs-srweather-app/tests/WE2E/test_configs/grids_extrn_mdls_suites_nco. I started checking if there are counterparts of these under grids_extrn_mdls_suites_community to see if the "nco_" tests can be eliminated but decided to not finish because it's a pretty time consuming task (at least a couple of hours). I made an issue at Remove any WE2E tests for NCO mode (names starting with "nco_") that test features that are already tested in community mode #896. Feel free to comment there if I didn't capture the issue correctly.
  2. Use of pre-staged grid/orog/sfc_climo data is not specific to NCO mode, but I'd say it is more important for NCO mode because (at least as of a couple of years ago) EMC (who always seem to run in NCO mode) did not want to have the make_[grid|orog|sfc_climo] tasks as part of their workflow, i.e. they always used pre-staged data. That's why NCO mode always uses pre-staged files (or at least used to; not sure now). But these are also important in community mode because once community-mode users get their grid/orog/sfc_climo files set up, they tend to want to skip those tasks in later experiments.

help='Explicitly set EXPT_BASEDIR for all experiments')
ap.add_argument('--exec_subdir', type=str,
help='Explicitly set EXEC_SUBDIR for all experiments')
ap.add_argument('--use_cron_to_relaunch', action='store_true',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Seeing the cron option reminded me of a bug that I've run into with the run_WE2E_tests.py script, which is that when the user's cron table contains a commented line (job) that is exactly identical to the line that this script would otherwise insert into the table, the script doesn't insert the line (so the experiment doesn't get (re)launched). I guess it's because the script doesn't check that the existing line is commented out and so thinks the line is already in the table. Can you run a quick test to see if that is still happening? Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Just realized the bug (if it still exists) might be in the script get_crontab_contents.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian Just included a fix for that bug as well, feel free to test it out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Ok, thanks. I'll trust you that it works!

@@ -0,0 +1,26 @@
metadata:
description: |-
This test checks the capability of the workflow to retrieve from NOAA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Please add in the description that this test is also to check tha the weather model can perform a relatively long forecast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@MichaelLueken
Copy link
Collaborator

@mkavulich - The Jenkins tests failed on August 26 due to an issue with Orion and the custom_ESGgrid_Central_Asia_3km test on Hera Intel. This test failed in the run_fcst task with the following:

FATAL from PE 126: NetCDF: Index exceeds dimension bound: netcdf_read_data_2d: file:INPUT/gfs_data.nc- variable:ps

I'm fairly certain that a rerun would allow the test to successfully run without issue. Before rerunning, I've been told that the EPIC role account now has rstprod permission. Would you be okay with uncommenting the MET_ensemble_verification_only_vx_time_lag test in coverage.hera.gnu.com and the custom_ESGgrid_Great_Lakes_snow_8km test in coverage.jet? I'd like to ensure that the EPIC account is able to pull and process the necessary files now.

@mkavulich and @gsketefian - I agree with @mkavulich, it is certainly a good idea to trim/remove the NCO WE2E tests. However, I don't feel that this should be done in this PR.

@MichaelLueken
Copy link
Collaborator

@mkavulich - It also sounds like the various RDHPCS machines are all using the EPIC role accounts. Would you be able to change the .cicd/Jenkinsfile, replacing jet-epic with jet. If you would rather, I can make the changes to your branch and either open a PR to your branch or push it directly if you make me a temporary collaborator.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken I have made those suggested changes, but they are not appearing here. According to https://www.githubstatus.com/, PRs are currently "degraded" so it seems like there's a problem on Github's end that's preventing the PR from being updated. Hopefully it's resolved soon and you can re-run the tests.

@MichaelLueken MichaelLueken added the run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile label Sep 5, 2023
@MichaelLueken
Copy link
Collaborator

@mkavulich - The changes look good and I have resubmitted the Jenkins tests. Once the retests are complete, this work should be ready to be merged. Thanks!

@MichaelLueken
Copy link
Collaborator

@mkavulich - The reruns of the Hera Intel tests are still seeing failures in the new custom_ESGgrid_Central_Asia_3km test (the last run overnight actually failed in the Functional Unit Tests stage of the pipeline, so it never made it to the actual tests). During yesterday's testing, the MET_ensemble_verification_only_vx_time_lag and custom_ESGgrid_Great_Lakes_snow_8km tests were still failing due to permission issues trying to pull the necessary files from HPSS. I've resubmitted the Jenkins tests on Jet to see if the issue has been corrected. I'll also reach out to the platform tools team on Slack and see when the role.epic account truly acquired rstprod project access.

The good news is that the tests successfully passed on Gaea and Orion. Additionally, outside of the tests noted above, all other tests have successfully passed.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken That is so strange, I've run the custom_ESGgrid_Central_Asia_3km probably a dozen times on Hera/Intel and not seen a failure. Hera is down for maintenance today, I don't suppose you know the details of the failure?

@MichaelLueken
Copy link
Collaborator

@mkavulich - Unfortunately, I'm not aware of the reason for the last failures of the test (though, it is likely due once again to the weird FATAL from PE 126: NetCDF: Index exceeds dimension bound: netcdf_read_data_2d: file:INPUT/gfs_data.nc- variable:ps failure). I've been able to run the custom_ESGgrid_Central_Asia_3km test without issue as well. However, having said that, I've only ever run the test in community mode. The automated Jenkins tests will run this test in NCO mode. The weird NetCDF failures seem to occur when several NCO tests are run simultaneously.

I've also cancelled the Jet testing, since HPSS is down for maintenance as well and the MET_ensemble_verification_only_vx_time_lag would fail due to this.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken Ahh, I somehow forgot that the Intel tests are the ones run in NCO mode. So then this is another fleeting symptom of #652?

@MichaelLueken
Copy link
Collaborator

@mkavulich - Yes, I suspect that this is a continuation of what was being encountered with Issue #652. Now, however, instead of mislinking COM directories, it is now leading to these weird NetCDF failures.

@MichaelLueken
Copy link
Collaborator

I was able to successfully run the Hera Intel tests manually this morning:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              23.29
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.69
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             764.38
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              14.25
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               6.34
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              15.10
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE               9.98
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               7.36
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             230.67
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             297.56
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             328.78
pregen_grid_orog_sfc_climo                                         COMPLETE               8.80
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1713.20

I had to use rocotorewind/boot on the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot test, but the rest successfully passed without needing to be rewound and booted.

I'm now waiting to see if the custom_ESGgrid_Great_Lakes_snow_8km test will successfully run on Jet. If it passes, then we can move forward with this work. If it fails, then the custom_ESGgrid_Great_Lakes_snow_8km and MET_ensemble_verification_only_vx_time_lag tests will once again need to be commented out until the Platform team can see what is happening, since the EPIC role account now has rstprod permission. @mkavulich, I will let you know as the test fails or not so that we can quickly make the necessary changes and merge.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The custom_ESGgrid_Great_Lakes_snow_8km test is continuing to fail on Jet:

202302170000          get_obs_nohrsc                    56024334                DEAD                 256         1          17.0
202302170000            get_obs_mrms                    56024335                DEAD                 256         1          28.0
202302170000            get_obs_ndas                    56024336                DEAD                 256         1          28.0

The error is:
ERROR: [FATAL] no permission to open HPSS archive file: /NCEPPROD/hpssprod/runhistory/rh2023/202302/20230217/dcom_20230217.tar

Please comment out the two verification tests until this issue can be sorted. Thanks!

@MichaelLueken
Copy link
Collaborator

@mkavulich - I have sent a request to the RDHPCS HPSS Helpdesk to ask that the EPIC role account be granted rstprod permissions on HPSS. Hopefully they are able to add this permission quickly, then I can submit the Hera GNU tests to make sure that the tests that require HPSS data are able to successfully pull the data from HPSS.

@MichaelLueken
Copy link
Collaborator

@mkavulich - Good news! Skylar has added the EPIC role account to the rstprod group on HPSS. Once the current Jet tests finish, then the Hera GNU tests will run. If the MET_ensemble_verification_ony_vx_time_lag test passes, then I should be able to merge this PR in the morning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the file name has .com extension?.. Was the intent to have *.nco file with the list of tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin - The .com extension is to ensure that all tests in the file run in the community environment. Within this file, there is an nco test, nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16. When this test is launched, the nco environment is overwritten so that it will run in community mode.

This is similar to the coverage.hera.intel.nco file. All of the tests in this file are normally run in community mode, but the .nco extension ensures that the tests use the nco environment instead.

There are no plans to include either a coverage.hera.intel.com or coverage.hera.gnu.nco file.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The Jenkins automated MET_ensemble_verification_only_vx_time_lag test successfully passed :

MET_ensemble_verification_only_vx_time_lag COMPLETE 7.58

Now moving forward with merging this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile
Projects
Status: Done
5 participants