3.11.0 start up time longer than 3.9.1 #6479

gwolski · 2024-10-17T21:55:42Z

Using 3.9.1 the time to start a compute node based on my custom AMI is taking 4:11 (four minutes, 11 seconds).
Moving to 3.11.0 same custom AMI configured with 3.11.0 now takes 4:47 (four minutes, 47 seconds).
These start up times come from starting an m7a.medium.

My users are already complaining.

Is there any "performance work" being done to improve these start up times?
The actual machine is up around time 3 minutes IIRC, would be nice if we could get under four minutes before job starts.

hanwen-pcluste · 2024-10-18T18:15:23Z

Hi Guntram,

To help us reproduce the issue, can you provide your cluster configuration file without sensitive information?

Performance work is being done. But we were not aware of any scaling speed difference between 3.9.1 and 3.11.0.

Thank you,
Hanwen

gwolski · 2024-10-18T19:02:53Z

Hello Hanwen,
I would be happy to provide, but before I do, and you dig into my config file, is it possible for you to run a comparison with the setup you have? I based my times on how long the machine is in CF STATE until the job goes to RUNNING as displayed by squeue. I'd hate for you to dig through my config file w/o first confirming at your end with your vanilla setups? Let me know what you see?

hanwen-pcluste · 2024-10-21T18:40:05Z

Hi Guntram,

I am not able to reproduce scaling time difference between 3.11.0 and 3.9.1. So a cluster configuration file is helpful for us to reproduce the issue.

Thank you,
Hanwen

gwolski · 2024-10-22T02:02:45Z

Hi Hanwen,
I will do some more benchmarking on Tuesday and get back to you with the results and the cluster configuration file.
--G

gwolski · 2024-10-24T02:16:33Z

I've got nothing definitive. I ran some tests by submitting jobs with srun. I watched the output of the squeue and noted the time at which it went from CONFIGURING to RUNNING. I even had some outliers that confuse me more. Here are the startup times for various instance types:

instance	3.9.1	3.11.0
r7i.large	4:49	5:03
m7a.medium	4:08	4:16
m7a.large	3:17	5:03
r7a.xlarge	3:24	4:25
m7a.4xlarge	6:28	5:04

I even had the m7a.large in 3.11.0 take 7:09 in one attempt. Go figure. I wish things were consistent, I don't understand why there should be such a strange variation.

If you have any articles/wiki/instructions on how to ensure I have the fastest startup times, I'd appreciate a link.
Someday, I hope we'll be able to hibernate systems and then revive them so start up times are on the order of seconds (under a minute).

joehellmersNOAA · 2024-10-24T18:13:24Z

@gwolski Thanks for collecting this data. This is very useful. @hanwen-pcluste It would be nice if AWS could somehow break down those times into the constituent parts to diagnose what the differences are.

gmarciani · 2024-11-13T15:55:07Z

@gwolski thank you for collecting the startup time.
Can you please share the cluster config file with private data redacted?

gwolski · 2024-11-13T16:39:50Z

cluster config and scontrol show info attached.

I went back and reviewed my data. Most of the start up times I show above are from spot machine launches. I have one OnDemand 3.11.1 launch of an r7a.medium that took 5:52 in 3.11.1.

I have been reviewing the slurmctld.log file and i think I can parse out launch to start times from that file. On my list to do once I resolve #6529

I also started a cluster using just your supported rhel8 x86_64 AMI yesterday. The first machine I started up with srun was a spot based r7a.medium.. It took 7:08 to go from CF to RUNNING.
tsi4_config_files.tar.gz

(If you find any private data in there that I failed to redact, please let me know so we can delete this attachment and reshare).

gwolski · 2024-11-13T18:26:41Z

I forgot to mention - my cluster config file is created by config files I code for https://github.com/aws-samples/aws-eda-slurm-cluster and cluster created by same.

(more anecdotal info: Just had 11 jobs all request spot r7a.medium take about 3:58 to boot. - Nice)

gwolski added the 3.x label Oct 17, 2024

gwolski closed this as completed Oct 24, 2024

QuintenSchrevens mentioned this issue Nov 6, 2024

p4d instance not able to run job with pcluster 3.11.1 #6549

Closed

gwolski mentioned this issue Nov 13, 2024

3.11.1 slurmctld core dumps with error message: double free or corruption (!prev) #6529

Open

gwolski reopened this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.11.0 start up time longer than 3.9.1 #6479

3.11.0 start up time longer than 3.9.1 #6479

gwolski commented Oct 17, 2024 •

edited

Loading

hanwen-pcluste commented Oct 18, 2024

gwolski commented Oct 18, 2024

hanwen-pcluste commented Oct 21, 2024

gwolski commented Oct 22, 2024

gwolski commented Oct 24, 2024

joehellmersNOAA commented Oct 24, 2024

gmarciani commented Nov 13, 2024

gwolski commented Nov 13, 2024

gwolski commented Nov 13, 2024

3.11.0 start up time longer than 3.9.1 #6479

3.11.0 start up time longer than 3.9.1 #6479

Comments

gwolski commented Oct 17, 2024 • edited Loading

hanwen-pcluste commented Oct 18, 2024

gwolski commented Oct 18, 2024

hanwen-pcluste commented Oct 21, 2024

gwolski commented Oct 22, 2024

gwolski commented Oct 24, 2024

joehellmersNOAA commented Oct 24, 2024

gmarciani commented Nov 13, 2024

gwolski commented Nov 13, 2024

gwolski commented Nov 13, 2024

gwolski commented Oct 17, 2024 •

edited

Loading