-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The MPI we use in the distributed tests is not CUDA-aware #3897
Comments
Preferences are loaded by the In general, you shouldn't set any preference when running on the Caltech clusters because everything is set for you by the module system. |
This happened: #3838 |
And CUDA runtime wasn't found in that PR: https://buildkite.com/clima/oceananigans-distributed/builds/4038#0192c76f-d6ea-4e48-a7fd-f1b22df9f89f/189-1063 so we just need to look at the PR before that... PS @Sbozzolo we realized there was a problem with the way we ran the tests that would allow the GPU tests to pass even if they didn't run on GPU |
Ok, I think this is the problematic PR: #3783 |
@Sbozzolo am I reading this right that ClimaAtmos does not (always) use climacommon? EDIT I suspect these are unused in favor of https://github.com/CliMA/ClimaAtmos.jl/blob/main/.buildkite/pipeline.yml |
There is something a little odd that we are using
But ClimaAtmos is on |
I can try reintegrating a Manifest using julia 1.11 in #3880 to see if it makes a difference |
We dont' support 1.11 yet though so its not a long term solution... |
Hmmm, ok I guess we probably have to revert to a previous version of climacommon |
That's a different machine.
If you don't support 1.11, you should stay stay on _10_08. |
I think we are also hitting this problem JuliaParallel/MPI.jl#715 julia -O0 --project -e 'using Pkg; Pkg.instantiate()` but then it loads a completely different MPI in the julia -O0 --project -e 'using Pkg; Pkg.test()` step |
Nice observation |
Somewhere between this commit
https://buildkite.com/clima/oceananigans-distributed/builds/3113#01917ace-fe81-401d-ba21-467037e6aead
and main, we switched from using
libmpitrampoline.so
in the distributed tests tolibmpi.so
downloaded from the artifacts.Previously, the mpi trampoline was loading a CUDA-aware implementation of Open MPI, while the libmpi.so we use now is a
MPICH implementation non CUDA-aware:
https://buildkite.com/clima/oceananigans-distributed/builds/4227#0192f70a-b947-4d38-bd1c-c2497a964de9
This makes our GPU distributed tests fail.
I am wondering where this switch happened because I couldn't trace any changes to the code. @Sbozzolo, do you know if something changed in the
LocalPreferences.toml
in the Caltech cluster?The text was updated successfully, but these errors were encountered: