-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[redgreengpu] CRAY_ACC_ERROR: host region overlaps present region but is not contained for 'pgp3a(:,:,:,:)' #134
Comments
Hi @okkevaneck - we have seen these errors before. I have just tested redgreengpu on LUMI-G and I am able to run under my own build/run framework. So the question is, what is different about yours. I'll look into it. By the way, ecKit and FCKit are not dependencies of ecTrans so you don't need to build those. More generally, so everyone is on the same page, let me summarise the current support of AMD GPUs with ecTrans:
|
Hi @samhatfield, thank you for the quick reply! Good to know Also many thanks for the overview of the current state. |
I wasn't able to follow your build instructions completely successfully. I get the interactive node with
(is this wrong?) Then I execute
The build finishes, but when I look at src/build/ectrans.log, I see
It should be
Is there something I'm missing? |
I allocate the node slightly different and SSH onto the compute node, maybe that's what's causing the difference. To allocate a node, I run: #!/usr/bin/env bash
JOB_NAME="ia_gpu_dev"
GPUS_PER_NODE=8
NODES=1
NTASKS=8
PARTITION="dev-g"
ACCOUNT="project_465000454"
TIME="01:00:00"
# Allocate interactive node with the set variables above.
salloc \
--gpus-per-node=$GPUS_PER_NODE \
--exclusive \
--nodes=$NODES \
--ntasks=$NTASKS \
--partition=$PARTITION \
--account=$ACCOUNT \
--time=$TIME \
--mem=0 \
--job-name=$JOB_NAME Then to get onto the compute node, I execute the following from a login node: And then I execute the script without any SLURM command, as we're already on the compute node: I forgot about the |
Will give it a go, thanks! I'm waiting quite long today to get allocated a node. |
Now I see
which is good. I still found it difficult to get an interactive session on a compute node:
Instead I ran
Now I've successfully built the binary. And I think I've found the cause of the problem. Could you try running without In my setup, I get the exact same error as you when I include For now, if you just want to benchmark ecTrans, you can leave this option off. In the mean time I'll try to find the cause of this bug. |
Hmm interesting, I wonder why the interactive node works for me.. I tried running without |
Great to hear it works. I'm figuring out how we might fix this so we can run with any NPROMA. Let's keep this issue open until we decide how to proceed. With the benchmark program the problem size in both spectral and grid point space can be set by a single parameter By default the benchmark driver will use an octahedral grid for grid point space with a cubic-accuracy representation of waves, which basically means the number of latitudes must be 2 * (truncation + 1). |
Ah that's how it works! |
I've compiled and installed the redgreenbranch on LUMI-G and I ran the
ectrans-benchmark-gpu-dp
binary. This unfortunately resulted in the following error message:I'm clueless to what the problem may be, so I've also included my installation setup as a tar.gz for anyone to try:
ectrans_dwarf.tar.gz
Simply acquire an interactive LUMI-G compute node and execute
./install_redgreengpu.sh
.This will clone, build, and install all required sources.
Then afterwards, go into a login node, and
cd
into therun
directory.Then
sbatch
therun_sbatch_lumi-g.sh
script to get the error output in theerr.<slurm_job_id>.0
file within theresults/sbatch/
folder.The text was updated successfully, but these errors were encountered: