-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to Parthenon with Kokkos 4.4.1 #13
Conversation
…brryan/kokkos_441
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
env/bash
Outdated
PARTITION="venado-gh" | ||
elif [[ "$HOSTNAME" =~ ^ve-rfe[4-7]$ || ( $SLURM_CLUSTER_NAME == "venado" && $SLURM_JOB_PARTITION == "cpu" ) ]]; then | ||
elif [[ "$HOSTNAME" =~ ^ve-rfe[4-7]$ || "$HOSTNAME" =~ ^ve-fe[4-7]$ || ( $SLURM_CLUSTER_NAME == "venado" && $SLURM_JOB_PARTITION == "cpu" ) ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can distinguish cpu/gpu from the hostname. You can be on a grace-grace frontend but submit to a grace-hopper backend. The hostname you check on the backend is still the grace-grace one. I think if you did hostname in [1-3] or $SLURM_GPUS_ON_NODE > 0
, that would always work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, thanks for noticing this. I thought at some point in the past HOSTNAME
wasn't defined on venado backends? But it definitely is now, maybe I'm just misremembering. Yes I can update this logic to fix this.
This also doesn't work for e.g. SLURM_JOB_PARTITION=gpu_debug
(it's only been sneaking through because I always use gpu frontends for gpu backends etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this should be fixed now. I tested it on cpu frontend, and gpu backend via either cpu frontend or gpu frontend. I'm not sure I have access to the CPU backends actually to test those
Background
Chicoma and Venado are failing at runtime when running with multiple GPUs using CUDA-aware MPI. Forrest found that moving to Kokkos 4.4.1 fixes this issue at least on Venado.
Description of Changes
env/bash
script to support Venado (sort of)constexpr if
capture)and warning (unused var) on recentnvcc
Checklist