-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCO mode (run_envir="nco"
) results in random failures for WE2E tests
#652
Comments
Update: I was able to replicate this error without explicitly passing the
From what I can tell, the issue is due to some sort of problem the same task runs at the same time for two different experiments. |
@mkavulich Have you found a solution for this. I think I am encountering this issue in PR #647 where re-running the tests in community mode works but not in NCO mode. Any idea which PR introduced this problem? |
@danielabdi-noaa This issue appears to be fixed for some tasks, but I am still seeing task failures in Doing a bit more digging, it appears as if there are at least two different failure modes currently. The first is a failure with no helpful error message, which seems to resolve just by rewinding and re-submitting the run_fcst task. The second appears to be an un-caught failure in the make_lbcs task, where one or more files are either not created or accidentally deleted somehow. I saved the output for this one on Hera: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/expt_dirs/fundamental_nco_fix/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR_old_20230314_194720 |
@mkavulich The way you described the problem there seems to be some non-deterministic behaviour so I won't be surprized if there are additional issues. That is why I was careful with my PR description that got merged. I made i clear that it is a partial fix, quoted here again:
Having said that, the fact that undoing the change regarding |
@danielabdi-noaa Thank you for your partial fix, I didn't mean to imply that you were responsible for fixing this problem. The issue was automatically closed because your PR referenced this issue, so I re-opened it to clarity that the problem still remains in part. |
@mkavulich @danielabdi-noaa FYI those tests involving verification fail in NCO mode, and I'm fixing that in my PR #695. Not sure if it's directly related to your issues though. |
The NCO sample configuration and NCO WE2E tests were removed in PR #1060. Before their removal, the random failures for the NCO WE2E tests were due for two reasons:
From December 14, 2023, the NCO WE2E tests were running as expected, until their removal in PR #1060, on March 27, 2024. Closing issue now. |
Expected behavior
The
run_envir
capability was included in therun_WE2E_tests.sh
script (and its replacementrun_WE2E_tests.py
) in order to be able to force tests to run with eitherrun_envir=community
orrun_envir=nco
, regardless of what setting was included in the test config yaml file. Calling the WE2E run script withrun_envir=nco
will force all tests to run in NCO mode. Ideally this should not be a problem, as even though thenco_dirs
directory is shared among the various tests, conflicts should be avoided by running each task in its own subdirectory.Current behavior
Currently, running several experiments in parallel in nco mode reveals some problems with the system. Tasks seem to fail randomly -- often without a descriptive error message -- and will work upon re-running.
I confirmed that this behavior is random by running the same set of experiments twice, and seeing a completely different set of failures in each run. Running this same set of tests without the
run_envir="nco"
option, or running each task serially so that no two tests were running at the same time, resulted in all successes.Examples of these failures can be found on Hera in /scratch2/BMC/fv3lam/kavulich/UFS/workdir/nco_tests/expt_dirs
Machines affected
All that I have tested so far (Hera and Jet). I assume this will affect all platforms.
Steps To Reproduce
Run the fundamental suite of WE2E tests using either the shell- or python-based run script:
or
Note Due to the error described in #571, the above sets of tests will not be the same on Hera. The error should still occur regardless (though due to its random/inconsistent nature, it may take a few tries to replicate).
The text was updated successfully, but these errors were encountered: