spot jobs not restarted after spot machine preempted. #6641

gwolski · 2025-01-13T00:37:32Z

This is somewhat a clone of #5731 except the work around there does not work for me.

Parallelcluster 3.11.1. Also happens with 3.9.1 but my debug work is in 3.11.1

Easy to reproduce. Create a cluster that uses spot machines, submit multiple jobs on these spot machines with "sleep 360000" and wait for machine to be pre-empted. The job will end up with (BadConstraints) and never restart.

Here are interesting sections of slurmctld.log, complete slurmctld.log attached along with all the other normally requested logs.

Job starts up fine:

2024-12-28T12:22:21.369] _slurm_rpc_submit_batch_job: JobId=11 InitPrio=1 usec=378
[2024-12-28T12:22:22.001] sched: Allocate JobId=11 NodeList=sp-r7a-m-dy-sp-8-gb-1-cores-1 #CPUs=1 Partition=sp-8-gb-1-cores
[2024-12-28T12:22:27.000] POWER: no more nodes to resume for job JobId=11
[2024-12-28T12:22:27.001] POWER: power_save: waking nodes sp-r7a-m-dy-sp-8-gb-1-cores-1

[2024-12-28T12:25:14.422] Node sp-r7a-m-dy-sp-8-gb-1-cores-1 rebooted 136 secs ago
[2024-12-28T12:25:14.422] Node sp-r7a-m-dy-sp-8-gb-1-cores-1 now responding
[2024-12-28T12:25:14.422] POWER: Node sp-r7a-m-dy-sp-8-gb-1-cores-1/sp-r7a-m-dy-sp-8-gb-1-cores-1/10.6.6.31 powered up with instance_id=, instance_type=
[2024-12-28T12:25:30.000] job_time_limit: Configuration for JobId=11 complete
[2024-12-28T12:25:30.000] Resetting JobId=11 start time for node power up

Then the machine is taken away and job never restarts as "Requested node configuration is not available" - even though there are many such nodes and other jobs can start just fine:

[2024-12-29T04:33:44.239] error: slurm_receive_msg [10.6.6.31:53650]: Zero Bytes were transmitted or received
[2024-12-29T04:33:44.241] error: slurm_receive_msg [10.6.6.31:56884]: Zero Bytes were transmitted or received
[2024-12-29T04:34:01.243] error: slurm_receive_msg [10.6.6.31:46926]: Zero Bytes were transmitted or received
[2024-12-29T04:34:18.246] error: slurm_receive_msg [10.6.6.31:45492]: Zero Bytes were transmitted or received
[2024-12-29T04:34:22.085] update_node: node sp-r7a-m-dy-sp-8-gb-1-cores-1 reason set to: Scheduler health check failed
[2024-12-29T04:34:22.085] requeue job JobId=11 due to failure of node sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:34:22.103] powering down node sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:34:22.139] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:34:22.139] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:34:22.139] error: unable to split forward hostlist
[2024-12-29T04:34:22.139] error: _thread_per_group_rpc: no ret_list given
[2024-12-29T04:34:35.248] error: slurm_receive_msg [10.6.6.31:50098]: Zero Bytes were transmitted or received
[2024-12-29T04:34:46.242] error: slurm_receive_msg [10.6.6.31:60004]: Zero Bytes were transmitted or received
[2024-12-29T04:34:52.251] error: slurm_receive_msg [10.6.6.31:60332]: Zero Bytes were transmitted or received
[2024-12-29T04:35:09.253] error: slurm_receive_msg [10.6.6.31:53798]: Zero Bytes were transmitted or received
[2024-12-29T04:35:16.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:35:16.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:36:17.004] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:36:17.004] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:37:16.002] Resending TERMINATE_JOB request JobId=11 Nodelist=sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:37:16.006] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:37:16.006] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:37:16.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:37:16.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:37:16.007] error: unable to split forward hostlist
[2024-12-29T04:37:16.007] error: _thread_per_group_rpc: no ret_list given
[2024-12-29T04:38:16.001] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-1 not responding, setting DOWN
[2024-12-29T04:38:16.002] _pick_best_nodes: JobId=11 never runnable in partition sp-8-gb-1-cores
[2024-12-29T04:38:16.002] sched: schedule: JobId=11 non-runnable: Requested node configuration is not available
[2024-12-29T04:38:27.000] POWER: power_save: suspending nodes sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:41:17.000] _pick_best_nodes: JobId=11 never runnable in partition sp-8-gb-1-cores
[2024-12-29T04:41:17.000] sched: schedule: JobId=11 non-runnable: Requested node configuration is not available

I'm attaching all the logs from the headNode. I do not understand why the parallelcluster/slurm_resume.log stops at 12/28 - maybe 'cuz I didn't submit any more jobs on this test cluster. The cluster is still up and the bad job, JobId=11 is still in BadConstraints mode:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                11 sp-8-gb-1 spotrest  gwolski PD       0:00      1 (BadConstraints)
$ scontrol show job 11
JobId=11 JobName=spotrestart4
   UserId=gwolski(101001) GroupId=tsiusers(100003) MCS_label=N/A
   Priority=0 Nice=0 Account=(null) QOS=normal
   JobState=PENDING Reason=BadConstraints FailedNode=sp-r7a-m-dy-sp-8-gb-1-cores-1 Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2024-12-28T12:22:21 EligibleTime=2024-12-28T12:22:21
   AccrueTime=2024-12-28T12:22:21
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-30T10:47:44 Scheduler=Main
   Partition=sp-8-gb-1-cores AllocNode:Sid=ip-10-6-9-65:1595065
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=sp-r7a-m-dy-sp-8-gb-1-cores-1
   BatchHost=sp-r7a-m-dy-sp-8-gb-1-cores-1
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=7782M,node=1,billing=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/users/gwolski/spotrestart4_sbatch.sh
   WorkDir=/users/gwolski
   StdErr=/users/gwolski/spotrestart4-11.err
   StdIn=/dev/null
   StdOut=/users/gwolski/spotrestart4-11.out
   Power=
   TresPerTask=cpu:1

I presently have a script to work around this, that checks for a BadConstraints job every minute and then requeues it on-demand, but that is definitely a kludge:

$ cat requeue_badconstraint_jobs2od.sh
#!/bin/bash

# script to find all jobs that have BadConstraint as their reason, and then resubmit them from spot to on-demand.
# work around a bug with spot machines not restarting jobs properly.
# Another work around is to set --Node=1, but I want to guarantee the job will finish so move it to on-demand so we're not
# caught in a spot failure loop.

if [ ! -d /opt/slurm/bin ]; then
  echo "$0: need to run on HeadNode to find /opt/slurm/bin"
  exit 1
fi

# Get the output from squeue and filter jobs with "BadConstraints" in the Reason column
/opt/slurm/bin/squeue --Format=Cluster,Partition,JobID,State,UserName,NumCPUs,MinMemory,Feature,Dependency,Licenses,NodeList:40,Reason -h | \
grep "BadConstraints" | \
while read -r line; do
    # Extract the JobID and Partition from the line
    jobid=$(echo "$line" | awk '{print $3}')
    partition=$(echo "$line" | awk '{print $2}')
    
    # Replace 'sp-' with 'od-' in the partition name
    new_partition=$(echo "$partition" | sed 's/^sp-/od-/')
    
    # Requeue the job with the updated partition
    Mail -s "Requeuing jobid=${jobid} to partition=${new_partition}" root 
[headnode.tar.gz](https://github.com/user-attachments/files/18391720/headnode.tar.gz)
< /dev/null
    /opt/slurm/bin/scontrol update jobid=${jobid} partition=${new_partition}
done

The text was updated successfully, but these errors were encountered:

gwolski added the 3.x label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spot jobs not restarted after spot machine preempted. #6641

spot jobs not restarted after spot machine preempted. #6641

gwolski commented Jan 13, 2025

spot jobs not restarted after spot machine preempted. #6641

spot jobs not restarted after spot machine preempted. #6641

Comments

gwolski commented Jan 13, 2025