You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is somewhat a clone of #5731 except the work around there does not work for me.
Parallelcluster 3.11.1. Also happens with 3.9.1 but my debug work is in 3.11.1
Easy to reproduce. Create a cluster that uses spot machines, submit multiple jobs on these spot machines with "sleep 360000" and wait for machine to be pre-empted. The job will end up with (BadConstraints) and never restart.
Here are interesting sections of slurmctld.log, complete slurmctld.log attached along with all the other normally requested logs.
Job starts up fine:
2024-12-28T12:22:21.369] _slurm_rpc_submit_batch_job: JobId=11 InitPrio=1 usec=378
[2024-12-28T12:22:22.001] sched: Allocate JobId=11 NodeList=sp-r7a-m-dy-sp-8-gb-1-cores-1 #CPUs=1 Partition=sp-8-gb-1-cores
[2024-12-28T12:22:27.000] POWER: no more nodes to resume for job JobId=11
[2024-12-28T12:22:27.001] POWER: power_save: waking nodes sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-28T12:25:14.422] Node sp-r7a-m-dy-sp-8-gb-1-cores-1 rebooted 136 secs ago
[2024-12-28T12:25:14.422] Node sp-r7a-m-dy-sp-8-gb-1-cores-1 now responding
[2024-12-28T12:25:14.422] POWER: Node sp-r7a-m-dy-sp-8-gb-1-cores-1/sp-r7a-m-dy-sp-8-gb-1-cores-1/10.6.6.31 powered up with instance_id=, instance_type=
[2024-12-28T12:25:30.000] job_time_limit: Configuration for JobId=11 complete
[2024-12-28T12:25:30.000] Resetting JobId=11 start time for node power up
Then the machine is taken away and job never restarts as "Requested node configuration is not available" - even though there are many such nodes and other jobs can start just fine:
[2024-12-29T04:33:44.239] error: slurm_receive_msg [10.6.6.31:53650]: Zero Bytes were transmitted or received
[2024-12-29T04:33:44.241] error: slurm_receive_msg [10.6.6.31:56884]: Zero Bytes were transmitted or received
[2024-12-29T04:34:01.243] error: slurm_receive_msg [10.6.6.31:46926]: Zero Bytes were transmitted or received
[2024-12-29T04:34:18.246] error: slurm_receive_msg [10.6.6.31:45492]: Zero Bytes were transmitted or received
[2024-12-29T04:34:22.085] update_node: node sp-r7a-m-dy-sp-8-gb-1-cores-1 reason set to: Scheduler health check failed
[2024-12-29T04:34:22.085] requeue job JobId=11 due to failure of node sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:34:22.103] powering down node sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:34:22.139] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:34:22.139] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:34:22.139] error: unable to split forward hostlist
[2024-12-29T04:34:22.139] error: _thread_per_group_rpc: no ret_list given
[2024-12-29T04:34:35.248] error: slurm_receive_msg [10.6.6.31:50098]: Zero Bytes were transmitted or received
[2024-12-29T04:34:46.242] error: slurm_receive_msg [10.6.6.31:60004]: Zero Bytes were transmitted or received
[2024-12-29T04:34:52.251] error: slurm_receive_msg [10.6.6.31:60332]: Zero Bytes were transmitted or received
[2024-12-29T04:35:09.253] error: slurm_receive_msg [10.6.6.31:53798]: Zero Bytes were transmitted or received
[2024-12-29T04:35:16.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:35:16.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:36:17.004] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:36:17.004] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:37:16.002] Resending TERMINATE_JOB request JobId=11 Nodelist=sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:37:16.006] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:37:16.006] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:37:16.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-1:6818) failed: Name or service not known
[2024-12-29T04:37:16.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-1"
[2024-12-29T04:37:16.007] error: unable to split forward hostlist
[2024-12-29T04:37:16.007] error: _thread_per_group_rpc: no ret_list given
[2024-12-29T04:38:16.001] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-1 not responding, setting DOWN
[2024-12-29T04:38:16.002] _pick_best_nodes: JobId=11 never runnable in partition sp-8-gb-1-cores
[2024-12-29T04:38:16.002] sched: schedule: JobId=11 non-runnable: Requested node configuration is not available
[2024-12-29T04:38:27.000] POWER: power_save: suspending nodes sp-r7a-m-dy-sp-8-gb-1-cores-1
[2024-12-29T04:41:17.000] _pick_best_nodes: JobId=11 never runnable in partition sp-8-gb-1-cores
[2024-12-29T04:41:17.000] sched: schedule: JobId=11 non-runnable: Requested node configuration is not available
I'm attaching all the logs from the headNode. I do not understand why the parallelcluster/slurm_resume.log stops at 12/28 - maybe 'cuz I didn't submit any more jobs on this test cluster. The cluster is still up and the bad job, JobId=11 is still in BadConstraints mode:
I presently have a script to work around this, that checks for a BadConstraints job every minute and then requeues it on-demand, but that is definitely a kludge:
$ cat requeue_badconstraint_jobs2od.sh
#!/bin/bash
# script to find all jobs that have BadConstraint as their reason, and then resubmit them from spot to on-demand.
# work around a bug with spot machines not restarting jobs properly.
# Another work around is to set --Node=1, but I want to guarantee the job will finish so move it to on-demand so we're not
# caught in a spot failure loop.
if [ ! -d /opt/slurm/bin ]; then
echo "$0: need to run on HeadNode to find /opt/slurm/bin"
exit 1
fi
# Get the output from squeue and filter jobs with "BadConstraints" in the Reason column
/opt/slurm/bin/squeue --Format=Cluster,Partition,JobID,State,UserName,NumCPUs,MinMemory,Feature,Dependency,Licenses,NodeList:40,Reason -h | \
grep "BadConstraints" | \
while read -r line; do
# Extract the JobID and Partition from the line
jobid=$(echo "$line" | awk '{print $3}')
partition=$(echo "$line" | awk '{print $2}')
# Replace 'sp-' with 'od-' in the partition name
new_partition=$(echo "$partition" | sed 's/^sp-/od-/')
# Requeue the job with the updated partition
Mail -s "Requeuing jobid=${jobid} to partition=${new_partition}" root
[headnode.tar.gz](https://github.com/user-attachments/files/18391720/headnode.tar.gz)
< /dev/null
/opt/slurm/bin/scontrol update jobid=${jobid} partition=${new_partition}
done
The text was updated successfully, but these errors were encountered:
This is somewhat a clone of #5731 except the work around there does not work for me.
Parallelcluster 3.11.1. Also happens with 3.9.1 but my debug work is in 3.11.1
Easy to reproduce. Create a cluster that uses spot machines, submit multiple jobs on these spot machines with "sleep 360000" and wait for machine to be pre-empted. The job will end up with (BadConstraints) and never restart.
Here are interesting sections of slurmctld.log, complete slurmctld.log attached along with all the other normally requested logs.
Job starts up fine:
Then the machine is taken away and job never restarts as "Requested node configuration is not available" - even though there are many such nodes and other jobs can start just fine:
I'm attaching all the logs from the headNode. I do not understand why the parallelcluster/slurm_resume.log stops at 12/28 - maybe 'cuz I didn't submit any more jobs on this test cluster. The cluster is still up and the bad job, JobId=11 is still in BadConstraints mode:
I presently have a script to work around this, that checks for a BadConstraints job every minute and then requeues it on-demand, but that is definitely a kludge:
The text was updated successfully, but these errors were encountered: