(3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure

The issue

With Parallelcluster 3.9.0-3.10.1, cfn-hup runs on all cluster nodes (head node, compute nodes, login nodes). cfn-hup is a script detecting changes in CloudFormation stacks and triggering update procedure. During cluster update, the head node waits for all nodes update completion before signaling successful cluster update. Due to an issue where cfn-hup hangs on some cluster nodes. The update procedure on the nodes are never started, therefore never completed, causing cluster update failures. Although cluster update failures could be caused by other issues, this document discusses the cluster update failure because of cfn-hup hanging.

Error details

You can see the cluster status via pcluster list-clusters:

$ pcluster list-clusters
{
  "clusters": [
    {
      "clusterName": "test",
      "cloudformationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
      "clusterStatus": "UPDATE_FAILED",
      ...
    }
  ]
}

/var/log/chef-client.log on the head node contains an error with the following information:

    ================================================================================
    Error executing action `run` on resource 'execute[Check cluster readiness]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of /opt/parallelcluster/pyenv/versions/3.9.19/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py --cluster-name demo-cluster --table-name parallelcluster-demo-cluster --config-version 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy --region us-west-2 ----
    STDOUT: 
    STDERR: INFO:__main__:Checking cluster readiness with arguments: cluster_name=demo-cluster, table_name=parallelcluster-demo-cluster, config_version=78dPxxb06z0XXX00hMwGxxxfzwxxPlYy, region=us-west-2
    INFO:__main__:Checking that cluster configuration deployed on cluster nodes for cluster demo-cluster is 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy
    INFO:botocore.credentials:Found credentials from IAM Role: demo-cluster-RoleHeadNode-xxxx
    INFO:__main__:Found batch of 4 cluster node(s): ['i-xxxxxxxxxxxxxxxxx', 'i-yyyyyyyyyyyyyyyyy', 'i-aaaaaaaaaaaaaaaaa', 'i-bbbbbbbbbbbbbbbbb']
    INFO:__main__:Retrieved 4 DDB item(s):
        {'Id': {'S': 'CLUSTER_CONFIG.i-xxxxxxxxxxxxxxxxx'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:50 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-yyyyyyyyyyyyyyyyy'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:37 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-aaaaaaaaaaaaaaaaa'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:38 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-bbbbbbbbbbbbbbbbb'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:33 UTC'}}}}
    ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
      * missing records (0): []
      * incomplete records (0): []
      * wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]

The error means that EC2 instances i-yyyyyyyyyyyyyyyyy and i-bbbbbbbbbbbbbbbbb are not updated with the latest cluster configuration. This may happen because cfn-hup hangs.

You can verify the cfn-hup hangs by running ps aux | grep cfn-hup on the instances. Below is the normal and irregular outputs:

# Normal output only contains two lines
root        ... ?        S    03:24   0:00 /bin/bash /opt/parallelcluster/scripts/cfn-hup-runner.sh
ec2-user    ... pts/0    S+   03:30   0:00 grep --color=auto cfn-hup

# Irregular output contains three lines, where the additional line is the hanging script.
ec2-user    ... pts/1    S+   00:23   0:00 grep --color=auto cfn-hup
root        ... ?        S    Aug02   0:01 /bin/bash /opt/parallelcluster/scripts/cfn-hup-runner.sh
root        ... ?        S    Aug04   0:00 /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/python3.9 /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-hup --no-daemon --verbose

Affected versions (OSes, schedulers)

ParallelCluster 3.9.0-3.10.1
Slurm scheduler
All operating systems

Mitigation

To recover from this situation and restore update functionality, please apply one of the two procedures: stopping the compute fleet or restarting cfn-hup on the affected instances. Stopping the compute fleet is easy to execute, but pauses all running jobs. Restarting cfn-hup requires manual efforts, but does not affect running jobs.

Option1: Stopping the compute fleet

Stop the compute fleet:

pcluster update-compute-fleet --region REGION --cluster-name CLUSTER_NAME --status STOP_REQUESTED

Update the cluster again

pcluster update-cluster ...

After the update is successful, start the compute fleet

pcluster update-compute-fleet --region REGION --cluster-name CLUSTER_NAME --status START_REQUESTED

Option2: Restarting cfn-hup on the affected instances

Find the list of the affected instance from the error in /var/log/chef-client.log on the head node. For example, the affected instances are i-yyyyyyyyyyyyyyyyy and i-bbbbbbbbbbbbbbbbb in the example error:

wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]

Run the following command on every instance on the list:

sudo /opt/parallelcluster/pyenv/versions/*/envs/cookbook_virtualenv/bin/supervisorctl restart cfn-hup

The output should be:

cfn-hup: stopped 
cfn-hup: started

(To connect to the instance, you can use SSH going through the head node or Session Manager.)

Update the cluster again:

pcluster update-cluster ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure

The issue

Error details

Affected versions (OSes, schedulers)

Mitigation

Clone this wiki locally