-
Notifications
You must be signed in to change notification settings - Fork 312
(3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure
With Parallelcluster 3.9.0-3.10.1, cfn-hup
runs on all cluster nodes (head node, compute nodes, login nodes). cfn-hup
is a script detecting changes in CloudFormation stacks and triggering update procedure. During cluster update, the head node waits for all nodes update completion before signaling successful cluster update. Due to an issue where cfn-hup
hangs on some cluster nodes. The update procedure on the nodes are never started, therefore never completed, causing cluster update failures. Although cluster update failures could be caused by other issues, this document discusses the cluster update failure because of cfn-hup
hanging.
You can see the cluster status via pcluster list-clusters
:
$ pcluster list-clusters
{
"clusters": [
{
"clusterName": "test",
"cloudformationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
"clusterStatus": "UPDATE_FAILED",
...
}
]
}
/var/log/chef-client.log
on the head node contains an error with the following information:
================================================================================
Error executing action `run` on resource 'execute[Check cluster readiness]'
================================================================================
Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of /opt/parallelcluster/pyenv/versions/3.9.19/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py --cluster-name demo-cluster --table-name parallelcluster-demo-cluster --config-version 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy --region us-west-2 ----
STDOUT:
STDERR: INFO:__main__:Checking cluster readiness with arguments: cluster_name=demo-cluster, table_name=parallelcluster-demo-cluster, config_version=78dPxxb06z0XXX00hMwGxxxfzwxxPlYy, region=us-west-2
INFO:__main__:Checking that cluster configuration deployed on cluster nodes for cluster demo-cluster is 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy
INFO:botocore.credentials:Found credentials from IAM Role: demo-cluster-RoleHeadNode-xxxx
INFO:__main__:Found batch of 4 cluster node(s): ['i-xxxxxxxxxxxxxxxxx', 'i-yyyyyyyyyyyyyyyyy', 'i-aaaaaaaaaaaaaaaaa', 'i-bbbbbbbbbbbbbbbbb']
INFO:__main__:Retrieved 4 DDB item(s):
{'Id': {'S': 'CLUSTER_CONFIG.i-xxxxxxxxxxxxxxxxx'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:50 UTC'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-yyyyyyyyyyyyyyyyy'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:37 UTC'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-aaaaaaaaaaaaaaaaa'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:38 UTC'}}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-bbbbbbbbbbbbbbbbb'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:33 UTC'}}}}
ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
* missing records (0): []
* incomplete records (0): []
* wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]
The error means that EC2 instances i-yyyyyyyyyyyyyyyyy
and i-bbbbbbbbbbbbbbbbb
are not updated with the latest cluster configuration. This may happen because cfn-hup
hangs.
You can verify the cfn-hup
hangs by running ps aux | grep cfn-hup
on the instances. Below is the normal and irregular outputs:
# Normal output only contains two lines
root ... ? S 03:24 0:00 /bin/bash /opt/parallelcluster/scripts/cfn-hup-runner.sh
ec2-user ... pts/0 S+ 03:30 0:00 grep --color=auto cfn-hup
# Irregular output contains three lines, where the additional line is the hanging script.
ec2-user ... pts/1 S+ 00:23 0:00 grep --color=auto cfn-hup
root ... ? S Aug02 0:01 /bin/bash /opt/parallelcluster/scripts/cfn-hup-runner.sh
root ... ? S Aug04 0:00 /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/python3.9 /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-hup --no-daemon --verbose
- ParallelCluster 3.9.0-3.10.1
- Slurm scheduler
- All operating systems
To recover from this situation and restore update functionality, please apply one of the two procedures: stopping the compute fleet or restarting cfn-hup
on the affected instances. Stopping the compute fleet is easy to execute, but pauses all running jobs. Restarting cfn-hup
requires manual efforts, but does not affect running jobs.
Option1: Stopping the compute fleet
- Stop the compute fleet:
pcluster update-compute-fleet --region REGION --cluster-name CLUSTER_NAME --status STOP_REQUESTED
- Update the cluster again
pcluster update-cluster ...
- After the update is successful, start the compute fleet
pcluster update-compute-fleet --region REGION --cluster-name CLUSTER_NAME --status START_REQUESTED
Option2: Restarting cfn-hup
on the affected instances
- Find the list of the affected instance from the error in
/var/log/chef-client.log
on the head node. For example, the affected instances arei-yyyyyyyyyyyyyyyyy
andi-bbbbbbbbbbbbbbbbb
in the example error:
wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]
- Run the following command on every instance on the list:
sudo /opt/parallelcluster/pyenv/versions/*/envs/cookbook_virtualenv/bin/supervisorctl restart cfn-hup
The output should be:
cfn-hup: stopped
cfn-hup: started
(To connect to the instance, you can use SSH going through the head node or Session Manager.)
- Update the cluster again:
pcluster update-cluster ...