Skip to content

(3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure

Hanwen edited this page Aug 23, 2024 · 1 revision

The issue

With Parallelcluster 3.9.0-3.10.1, cfn-hup runs on all cluster nodes (head node, compute nodes, login nodes). cfn-hup is a script detecting changes in CloudFormation stacks and triggering update procedure. During cluster update, the head node waits for all nodes update completion before signaling successful cluster update. Due to an issue where cfn-hup hangs on some cluster nodes. The update procedure on the nodes are never started, therefore never completed, causing cluster update failures. Although cluster update failures could be caused by other issues, this document discusses the cluster update failure because of cfn-hup hanging.

Error details

You can see the cluster status via pcluster list-clusters:

$ pcluster list-clusters
{
  "clusters": [
    {
      "clusterName": "test",
      "cloudformationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
      "clusterStatus": "UPDATE_FAILED",
      ...
    }
  ]
}

/var/log/chef-client.log on the head node contains an error with the following information:

    ================================================================================
    Error executing action `run` on resource 'execute[Check cluster readiness]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of /opt/parallelcluster/pyenv/versions/3.9.19/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py --cluster-name demo-cluster --table-name parallelcluster-demo-cluster --config-version 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy --region us-west-2 ----
    STDOUT: 
    STDERR: INFO:__main__:Checking cluster readiness with arguments: cluster_name=demo-cluster, table_name=parallelcluster-demo-cluster, config_version=78dPxxb06z0XXX00hMwGxxxfzwxxPlYy, region=us-west-2
    INFO:__main__:Checking that cluster configuration deployed on cluster nodes for cluster demo-cluster is 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy
    INFO:botocore.credentials:Found credentials from IAM Role: demo-cluster-RoleHeadNode-xxxx
    INFO:__main__:Found batch of 4 cluster node(s): ['i-xxxxxxxxxxxxxxxxx', 'i-yyyyyyyyyyyyyyyyy', 'i-aaaaaaaaaaaaaaaaa', 'i-bbbbbbbbbbbbbbbbb']
    INFO:__main__:Retrieved 4 DDB item(s):
        {'Id': {'S': 'CLUSTER_CONFIG.i-xxxxxxxxxxxxxxxxx'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:50 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-yyyyyyyyyyyyyyyyy'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:37 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-aaaaaaaaaaaaaaaaa'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:38 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-bbbbbbbbbbbbbbbbb'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:33 UTC'}}}}
    ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
      * missing records (0): []
      * incomplete records (0): []
      * wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]

The error means that EC2 instances i-yyyyyyyyyyyyyyyyy and i-bbbbbbbbbbbbbbbbb are not updated with the latest cluster configuration. This may happen because cfn-hup hangs.

You can verify the cfn-hup hangs by running ps aux | grep cfn-hup on the instances. Below is the normal and irregular outputs:

# Normal output only contains two lines
root        ... ?        S    03:24   0:00 /bin/bash /opt/parallelcluster/scripts/cfn-hup-runner.sh
ec2-user    ... pts/0    S+   03:30   0:00 grep --color=auto cfn-hup

# Irregular output contains three lines, where the additional line is the hanging script.
ec2-user    ... pts/1    S+   00:23   0:00 grep --color=auto cfn-hup
root        ... ?        S    Aug02   0:01 /bin/bash /opt/parallelcluster/scripts/cfn-hup-runner.sh
root        ... ?        S    Aug04   0:00 /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/python3.9 /opt/parallelcluster/pyenv/versions/3.9.19/envs/cfn_bootstrap_virtualenv/bin/cfn-hup --no-daemon --verbose

Affected versions (OSes, schedulers)

  • ParallelCluster 3.9.0-3.10.1
  • Slurm scheduler
  • All operating systems

Mitigation

To recover from this situation and restore update functionality, please apply one of the two procedures: stopping the compute fleet or restarting cfn-hup on the affected instances. Stopping the compute fleet is easy to execute, but pauses all running jobs. Restarting cfn-hup requires manual efforts, but does not affect running jobs.

Option1: Stopping the compute fleet

  1. Stop the compute fleet:
pcluster update-compute-fleet --region REGION --cluster-name CLUSTER_NAME --status STOP_REQUESTED  
  1. Update the cluster again
pcluster update-cluster ...
  1. After the update is successful, start the compute fleet
pcluster update-compute-fleet --region REGION --cluster-name CLUSTER_NAME --status START_REQUESTED  

Option2: Restarting cfn-hup on the affected instances

  1. Find the list of the affected instance from the error in /var/log/chef-client.log on the head node. For example, the affected instances are i-yyyyyyyyyyyyyyyyy and i-bbbbbbbbbbbbbbbbb in the example error:
wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]
  1. Run the following command on every instance on the list:
sudo /opt/parallelcluster/pyenv/versions/*/envs/cookbook_virtualenv/bin/supervisorctl restart cfn-hup

The output should be:

cfn-hup: stopped 
cfn-hup: started

(To connect to the instance, you can use SSH going through the head node or Session Manager.)

  1. Update the cluster again:
pcluster update-cluster ...
Clone this wiki locally