-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] aurora config, clean up #6916
base: master
Are you sure you want to change the base?
Conversation
this commit is running on sunspot
|
i should have not committed to here turning off compose and mam, will remove once this PR is finalized. |
Looks good. A couple of points:
|
if (Kokkos_ENABLE_SYCL) | ||
#enable_language(SYCL) | ||
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE) | ||
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)
@@ -23,7 +23,8 @@ set(BUILD_HOMME_PREQX_KOKKOS OFF CACHE BOOL "") | |||
set(BUILD_HOMME_PESE OFF CACHE BOOL "") | |||
set(BUILD_HOMME_SWIM OFF CACHE BOOL "") | |||
set(BUILD_HOMME_PRIM OFF CACHE BOOL "") | |||
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | |||
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | |||
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here this is what needs to be removed from this PR.
|
||
#<<<<<<< HEAD | ||
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.
i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch. this is another issue to sort out (but i would not know how). |
Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383 @bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM? If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch. More generally, testing across machines will help to flush out issues, if there are any, including any with ekat. |
there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks. |
As an example, here is one commit I'm either confused about or is truly missing from your branch: E3SM-Project/EKAT@5804bfa. This commit was merged in late Nov 2024, which is why I'm thinking there's a fair bit missing. Edit: I think EKAT PR 347 was the last merged into your branch. PRs 349-356 are missing. (357-359 are in EKAT master but not used yet in E3SM, and 348 was closed without merging.) Thus, what I wrote above is probably the correct solution: Rebase your branch on EKAT master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I have a couple of questions/remarks, but nothing too important. I'm going to approve, modulo the fix of EKAT submodule that you are already working on.
string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O2") | ||
string(APPEND CMAKE_Fortran_FLAGS_DEBUG " -O0 -g -fpe0") | ||
|
||
#adding -g here leads to linker internal errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yet I see -g
below. Is this comment outdated, and link is fine now? If so, we should remove it to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"just -g" is in debug flags only, and i haven't tried to build debug. so i only changed release flags.
i was able to compile and to run with "just -g" on sunspot, however, the folder size for e3sm build gets from say 4 gb to 140 gb with full -g. last i tried "just -g" on aurora, there were linking issues. so this is all WIP. i'd say keep this as is now and sort it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the comment is confusing: it makes it sound like we should not use -g
b/c of link errors, but we do use it across the board. Maybe the comment needs to be clarified. Something like "WARNING: the -g flag seems to have a wild behavior sometimes, with side effects ranging from VERY large binaries to link errors. If you experience errors, remove it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
@@ -426,6 +426,10 @@ do_remap_fwd() | |||
const int team_size = std::min(256, std::min(128*m_num_phys_cols,32*(concurrency/this->m_num_fields+31)/32)); | |||
#endif | |||
|
|||
#ifdef KOKKOS_ENABLE_SYCL | |||
const int team_size = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzled by this. Do we really want just 4 on sycl? It's staggeringly different from other GPU platforms' default...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change is from Nov 2023. i looked at my notes and did not see why it was needed. i will try to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the P-D remapper (not the GllFvPhys one), so its relevance is somewhat limited (doesn't run at runtime, usually). So not a huge deal. But if things work without, no point in limiting team size to 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used hip TS calculation and it worked at least for ne4 run.
#<<<<<<< HEAD | ||
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) | ||
#======= | ||
find_library(NETCDF_C netcdf HINTS ${NetCDF_C_PATH}/lib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While at it, do you mind adding ${NetCDF_C_PATH}/lib64
to the hints? Since lib64
is quite common, we should be proactive and add it to the list of hints (e..g, my laptop and workstation use lib64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
Looks like you need to merge ekat master into your branch one more time to get the macro name fix. |
ran this commit with kokkos 4.5.01 tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Note that SCREAM nightlies run on master, not next, so I always run e3sm_scream_v1_medres on Frontier, Chrysalis, and pm-gpu before merging a complex PR to avoid having to fix master later. I recommend it to others, as well.
I added Conrad to the reviewers since he's working on upgrading Kokkos for multiple reasons, including Aurora readiness. |
CI is failing with a checkout on the ekat submodule.
Looking at the branch, https://github.com/E3SM-Project/EKAT/commits/oksanaguba/spot/, it's unclear where 1eae1a is coming from. @oksanaguba I suggest checking the submodule state in this PR w.r.t. to your ekat branch. |
@ambrad my ekat branch had to use kokkos main repo. if CI somehow does not know to add a remote (the default remote in ekat is poinitng to e3sm kokkos (whatever it is)), then it would not be able to fetch my commit. |
I'm sure I'm confused, but on your ekat branch I see this as the Kokkos submodule: https://github.com/E3SM-Project/kokkos/tree/e6e5c4598d16d756db62225ab7f937ee833bd660. This appears to be in E3SM-Project/kokkos. |
Re: commit 1eae1a, I don't see where it's coming from. I don't see it in your EKAT spot branch, but I could easily be mistaken. |
Re: Kokkos: Ok, I'm guessing what's happening is you have your local submodule pointed to the Kokkos repo with this commit, but, as you wrote, the submodule metadata points to E3SM-Project/Kokkos. Generally, we have to put some extra commits on top of a Kokkos version, which is why E3SM-Project/Kokkos exists. Thus, I suggest you start by creating a branch in E3SM-Project/Kokkos with the desired Kokkos state and then use this in the submodule. That will let testing proceed. |
i'll ask Conrad as he prob already has a branch |
i pointed the branch to an e3sm kokkos branch, but haven't had a chance to see if CI is happier. also, will do more testing tomorrow. |
08eb2de
to
7cd2802
Compare
@ambrad the CI tests fail b/c we have disabled deprecated code in kokkos. The error we get is
I saw you added those lines, so maybe you can figure out what the fix is? I am not very familiar with the atomic operations in kk... But since deprecated code will likely go away with kokkos 5 in 6 months, we should be proactive and fix this now. Edit: I can take a look too, of course, but if you already know what the right thing to do, that may be easier. If you'd rather me look into that, that's fine too. |
find_library(NETCDF_C netcdf HINTS ${NetCDF_C_PATH}/lib) | ||
|
||
find_library(NETCDF_C netcdf HINTS ${NETCDF_C_PATH}/lib) | ||
find_library(NETCDF_C netcdf HINTS ${NETCDF_C_PATH}/lib64) | ||
target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share Kokkos::kokkos) | ||
target_include_directories(scream_rrtmgp_yakl PUBLIC | ||
${CMAKE_CURRENT_SOURCE_DIR}) | ||
target_include_directories(scream_rrtmgp_yakl SYSTEM PUBLIC | ||
${NetCDF_C_PATH}/include | ||
${NETCDF_C_PATH}/include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this broke the gh/ci tests, but maybe the machine configs need to have consistent ENV VARS ..
-- Configuring done (5.4s)
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
NETCDF_C
linked by target "scream_rrtmgp_yakl" in directory /__w/E3SM/E3SM/components/eamxx/src/physics/rrtmgp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartgol any guess why gh/ci freaked out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ghci-oci machine entry has:
<environment_variables mpilib="!mpi-serial">
<env name="NETCDF_PATH">/usr/local/packages</env>
<env name="PNETCDF_PATH">/usr/local/packages</env>
<env name="HDF5_ROOT">/usr/local/packages</env>
<env name="PATH">/usr/local/packages/bin:$ENV{PATH}</env>
<env name="LD_LIBRARY_PATH">/usr/local/packages/lib</env>
</environment_variables>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation would be guard additions needed for sycl inside if-else for sycl; otherwise, this may lead to problems all over the place...
I know how to fix this error. What ekat branch should I make a commit to? |
@bartgol If you have a branch going, the fix is this:
becomes
|
Great. Yes, I can take care of it. |
set(NETCDF_PATH "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | ||
set(NETCDF_DIR "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | ||
set(NETCDF_C_PATH "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | ||
set(NETCDF_C "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this belongs to the machine entry, not here, e.g.,
<environment_variables>
<env name="NETCDF_PATH">/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002</env>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to run test-all-scream, I think you do need this machine file. Which speaks to a broader issue: how to make test-all-scream use config_machines for this stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the path only, i.e., this change for the first line:
set(NETCDF_PATH "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | |
set(NETCDF_DIR "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | |
set(NETCDF_C_PATH "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | |
set(NETCDF_C "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | |
set(NETCDF_PATH ""$ENV{NETCDF_PATH}"") | |
set(NETCDF_DIR "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | |
set(NETCDF_C_PATH "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") | |
set(NETCDF_C "/lus/flare/projects/E3SM_Dec/soft/netcdf/4.9.2c-4.6.1f/oneapi.eng.2024.07.30.002") |
and then repeat for others by setting the paths in the xml and just get the env var here.
What's with NetCDF_C_PATH --> NETCDF_C_PATH though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's fine to use these cmake files for test-all-scream. But our CIME runs still end up using these files, which is something I'd like to stop.
@jgfouca this PR causes an error in rrtmgp testing, due to the pool allocator running out of space: rrtmgp_tests: /home/runner/_work/E3SM/E3SM/components/eamxx/../eam/src/physics/rrtmgp/external/cpp/rrtmgp_conversion.h:655: static T* conv::MemPoolSingleton<RealT, DeviceT>::alloc_raw(int64_t) [with T = double; RealT = double; DeviceT = Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>; int64_t = long int]: Assertion `s_curr_used <= s_mem.size()' failed. I used gdb, and indeed we do run out of mem. But this PR does not touch rrtmgp source code (except for adding a template keyword somewhere, which is correct). Do you have any idea why things may have happened here? |
The CI failure on CUDA for the DBG build is not one that we love to see...
I'll re-trigger, hoping it was just a fluke. Based on previous experience, however, I'm preparing for an annoying fight... The fails for the CUDA OPT are puzzling. Besides the rrtmgp error (see comment above for Jim), the baseline_cmp are a bit of a surprise. They don't show a clear pattern (like "ah, it's rrtmgp the culprit"), so I'm a bit worried some changes (ekat? kokkos?) may be non-bfb. |
Finally, the gh/ci runs fail due to a config error:
Maybe @mahf708 has something to say about this, since he maintains that workflow... |
@bartgol re: rrtmgp, note this branch turns on the Kokkos version of RRTMGP. I pinged Jim in this convo some days ago about a failure on Chrysalis that might be related to what you noted. Re: CUDA, this branch built and ran on PM-GPU on Monday, so the ICE you see is likely version dependent. Did you see what source file triggers the ICE? |
Re gh/ci tests, I am not convinced the cmake changes in rrtmgp are appropriate; @bartgol could you take a look at my comment here #6916 (comment)? It's likely we need to add appropriate env vars elsewhere, but my question is, why introduce new env vars here to begin with? |
@bartgol ok, it must be this:
Could happen in other files, too, but we don't know since the build dies pretty early. How can we get on this particular compute resource and debug this? On pm-gpu, cudatoolkit/12.2 is used. Any idea what's used on these runners? |
The ci testing uses a container image that we build. I'm not in front of my computer now (writing from phone) so I can't check, but I think I picked a version of GCC and CUDA that would be very close to what's on pm-gpu (may not be exactly the same since I had to pick what spack had). I am off tomorrow, but if there's no urgency, I can definitely hop on Blake on Monday, run the image, and try to bisect what part of that file makes CUDA unhappy. |
@ambrad it is not just the cuda version, probably the target arch too (the containers target 90, but we target 80 on pm-gpu) |
If you download the zip files, you can find the full cmake cache, here's it is for reference, it appears it is cudatoolkit 12.1 CMakeCache.txt
|
@bartgol it's not urgent. I think next week I'll make a draft PR that isn't intended to be merged so that I can run the CI as much as I want. I have an idea for how I can aggressively change that file to make many small kernels, only one of which is used at runtime. I.e., I'll turn runtime options into template parameters. My guess is that will work around the problem. Edit: I just noticed the debug build has this error but not the opt build. I might be able to reproduce that on PM-GPU. Edit: No, a _D test built without a problem. |
@bartgol the latest ekat update is breaking runs. I think I wasn't clear in my explanation. Let me try again: Suppose we had this: if (ko::atomic_compare_exchange_strong(a, b, c))) ... Transform this to: if (b == ko::atomic_compare_exchange(a, b, c))) ... For example, if ( ! Kokkos::atomic_compare_exchange_strong(&_open_ws_slots(ws_idx), (flag_type) 0, (flag_type) 1)) { becomes if ( ! ((flag_type) 0 == Kokkos::atomic_compare_exchange(&_open_ws_slots(ws_idx), (flag_type) 0, (flag_type) 1))) { Right now the commit shows, instead, if ( ! Kokkos::atomic_compare_exchange(&_open_ws_slots(ws_idx), (flag_type) 0, (flag_type) 1)) which looks like it just removes There should be a unit test for the workspace manager. Related to that, do EKAT unit tests still get run in our nightlies? I thought that at one time they did, but now I'm not finding them on the dashboard. |
Ah yes, I completely misunderstood then. It's odd though, the tests passed when I ran in the CUDA container. Anyhow, I'll fix that too if it's still there on Monday. |
Good question. I don't remember what I did with ekat testing on our new CI. Tot Mart be right that it ended up getting disabled. |
4d307ca
to
d8a950f
Compare
@bartgol , i can't explain why this PR would cause the pool allocator to run out. I will look at why EKAT tests stopped running. |
It looks like the transition from jenkins-based nightlies to github actions caused the EKAT testing to be lost. I will work to re-enable it. |
No description provided.