-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.11.0 start up time longer than 3.9.1 #6479
Comments
Hi Guntram, To help us reproduce the issue, can you provide your cluster configuration file without sensitive information? Performance work is being done. But we were not aware of any scaling speed difference between 3.9.1 and 3.11.0. Thank you, |
Hello Hanwen, |
Hi Guntram, I am not able to reproduce scaling time difference between 3.11.0 and 3.9.1. So a cluster configuration file is helpful for us to reproduce the issue. Thank you, |
Hi Hanwen, |
I've got nothing definitive. I ran some tests by submitting jobs with srun. I watched the output of the squeue and noted the time at which it went from CONFIGURING to RUNNING. I even had some outliers that confuse me more. Here are the startup times for various instance types: <style> </style>
I even had the m7a.large in 3.11.0 take 7:09 in one attempt. Go figure. I wish things were consistent, I don't understand why there should be such a strange variation. If you have any articles/wiki/instructions on how to ensure I have the fastest startup times, I'd appreciate a link. |
@gwolski Thanks for collecting this data. This is very useful. @hanwen-pcluste It would be nice if AWS could somehow break down those times into the constituent parts to diagnose what the differences are. |
@gwolski thank you for collecting the startup time. |
cluster config and scontrol show info attached. I went back and reviewed my data. Most of the start up times I show above are from spot machine launches. I have one OnDemand 3.11.1 launch of an r7a.medium that took 5:52 in 3.11.1. I have been reviewing the slurmctld.log file and i think I can parse out launch to start times from that file. On my list to do once I resolve #6529 I also started a cluster using just your supported rhel8 x86_64 AMI yesterday. The first machine I started up with srun was a spot based r7a.medium.. It took 7:08 to go from CF to RUNNING. (If you find any private data in there that I failed to redact, please let me know so we can delete this attachment and reshare). |
I forgot to mention - my cluster config file is created by config files I code for https://github.com/aws-samples/aws-eda-slurm-cluster and cluster created by same. (more anecdotal info: Just had 11 jobs all request spot r7a.medium take about 3:58 to boot. - Nice) |
Using 3.9.1 the time to start a compute node based on my custom AMI is taking 4:11 (four minutes, 11 seconds).
Moving to 3.11.0 same custom AMI configured with 3.11.0 now takes 4:47 (four minutes, 47 seconds).
These start up times come from starting an m7a.medium.
My users are already complaining.
Is there any "performance work" being done to improve these start up times?
The actual machine is up around time 3 minutes IIRC, would be nice if we could get under four minutes before job starts.
The text was updated successfully, but these errors were encountered: