-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449
Comments
Hi Stefan! Thank you for the detailed description. I could reproduce the same issue. The same logs appears in the I am actively working on this and will keep you updated! Thank you, |
Hi Stefan, ParallelCluster has never explicitly configured Sockets and Cores for Slurm nodes, therefore Slurm uses its defaults. This could be due to Slurm 23.11 changing the way the value for Sockets and Cores are computed. Were you able to confirm that after setting the expected values for Sockets and Cores in slurm.conf the performance degradation is resolved? I don't expect seeing relevant changes in scheduling behaviour due to the lack of Sockets/Cores configuration that justify such a big regression. Would you be able to extract some logs showing how processes are mapped to the various cores? also if you don't mind can you share the cluster configuration and a potential reproducer? Also if you don't mind could you share the full Slurm config from both clusters? You can retrieve it with If Sockets and Cores configuration turns out to be a red herring here is another potential issue to look into: Francesco |
Hey @demartinofra - thanks for the reply! For my testing, I did set the following in the PCluster configuration to force the proper configuration:
Which did yield proper configuration via slurmd (HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10):
For our applications, I did note some improvement in performance and recouped a few percent of the ~40% degradation using the proper hardware configuration. So, not quite red herring, but definitely not the solution either! Regarding the SRSO mitigation - thanks for passing this along. This is news to me and is definitely something I am going to investigate further. From what I can see, HPC6a with PCluster 3.11 base AMI has that patch as you refer to:
Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster? The patch seemingly cant be removed during post install procedures because it requires instance reboot and once you reboot, slurm will detect the instance as "down" and will swap it out. I would rather not have to create a custom AMI if there is some other way to test this out. Thanks! |
If you want to test it real quick one option is to run the following on the compute nodes:
and then reboot them through the scheduler, so that Slurm does not mark the nodes as unhealthy and the reboot is successful:
|
Hi @demartinofra - I ran the commands you suggested to disable SRSO mitigation and rebooted via slurm which resulted in the patching being disabled:
I then ran one of our smaller-scale hybrid MPI-openMP jobs and the performance was expected with no ~40% performance degradation (I also corrected the HPC6a configuration, which also did help with performance a little). So, it definitely seems like this SRSO mitigation is the culprit for our application slowdowns...and I'll doubly confirm with our larger-scale job. What do you suggest as a more formal workaround for the SRSO mitigation in the PCluster realm? Custom AMI? Something else? When we had performance issues because of the log4j patch, it was a simple |
Hi Stefan, We will work on a Wiki page to describe the mitigation in pcluster realm and let you know when it is done. Thank you Stefan and Francesco for discovering the issue! |
Also, please avoid using 3.11.0 because of the known issue https://github.com/aws/aws-parallelcluster/wiki/(3.11.0)-Job-submission-failure-caused-by-race-condition-in-Pyxis-configuration |
Hi Stefan, We've published Wiki page (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors Moreover, we've released ParallelCluster 3.11.1 Cheers, |
A follow up on this as we have been finally able to do a lot more testing with newer versions of PCluster. We have disabled SRSO following the guide for PCluster 3.11.1 AMIs, both AL2 and AL2023 OSes. For our large scale hybrid MPI-openMP application that runs on ~200 hpc6a, we still see substantial performance degradation compared to PCluster 3.8.0 even with the SRSO disabled on both OSes. Further, the PCluster 3.8.0 AL2 AMI we currently use in production does ship with the SRSO mitigation enabled; we have never disabled it. Spinning up a hpc6a with the base us-east-2 PCluster 3.8.0 AL2 AMI (ami-03e71395f1580f16e) yields:
So, something else is going on that is causing issues with large-scale applications/jobs. Its worth reiterating - disabling SRSO in PCluster 3.11.1 AMIs DID help return performance back to near-normal for a small-scale (2 hpc6a) MPI job, but it wasn't the cure for our job using ~200 hpc6a. Were there any other foundational changes that could cause scaling issues in newer versions of PCluster? |
Another update here as we've continued testing with PCluster 3.12.0. We are still seeing performance degradation at scale with PCluster 3.10+, including 3.12.0 on both AL2 and AL2023. After a lot more digging, we've noticed that the network throughput (EFA traffic) is substantially less in the newer versions. Our latest tests were with the following: Cluster 1:
Cluster 2 (current production environment):
Attaching screenshots of an instance from both test clusters showing network in/out and network packets in/out using 5-min averages (top is cluster 1, bottom is cluster 2). During the main MPI job, test cluster 2 has consistent 5-min average network in/out performance of 115+ Gb and packets exceeding 30M. In contrast, test cluster 1 has significantly less 5-min average network in/out performance, varying between 80 and 90 Gb with packets hovering around 24-26M. Further, the traffic is much more volatile (sawtooth pattern). This performance degradation is consistent with other instances within the cluster, but for ease of showing in plot, we isolated it down to 1 compute instance from each. I am not sure what could be causing this performance drop and could use some pointers on where to dig into next if there are any EFA-related configurations that might have changed. Since its a pretty large version bump in EFA installer, Im sure there are a lot of moving parts that could be the culprit.
|
Hello,
We have been testing to upgrade from PCluster 3.8.0 to 3.11.0 and noticed some differences that impact performance after extensive testing of our applications. We run hybrid MPI-openMP applications using HPC6a.48xlarge instances and noticed that after testing PCluster 3.10.1 or 3.11.0 all of our applications are running ~40% slower than 3.8.0 using the out-of-the-box PCluster AMIs associated with either version. We narrowed down the issue by downgrading/changing versions of performance impacting software (such as EFA installer, downgrading to v1.32.0 or v1.33.0), switching how the job is submitted/run in Slurm (Hydra bootstrap and mpiexec vs PMIv2 and srun), and some other changes that did not improve the degraded performance.
Upon investigation, we noticed that the slurmd compute daemon on the HPC6a.48xlarge instances incorrectly identifies the hardware configuration, resulting in improper job placement and degraded performance. Snapshots of the slurmd from varying versions of PCluster as follows:
HPC6a.48xlarge on PCluster 3.8.0 with Slurm 23.02.7 (correct when considering NUMA node as socket):
HPC6a.48xlarge on PCluster 3.10.1 with Slurm 23.11.7:
HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10:
lscpu from a HPC6a.48xlarge instance:
Is there some fix (or workaround) to properly reconfigure the node configuration in PCluster 3.11.0? It looks like some process/script that was run in 3.8.0 (e.g. line:
[2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries ...
) is either not being run or not running properly. We'd prefer not to hard code the proper node configuration in the PCluster compute resource YAML as we dynamically spin up/down clusters and could use difference instance types in a given compute resource depending on resource availability.Thanks for any help you can provide!
The text was updated successfully, but these errors were encountered: