-
Notifications
You must be signed in to change notification settings - Fork 312
Best Practices for Upgrading a Cluster
While some configuration parameters can be updated while clusters are running (see: pcluster update policy) some changes can only be applied by creating a new cluster. Changes that can be made to a running cluster are referred to as updates, and can be made using the pcluster update
command, as linked above. Some examples of settings that can be changed on running clusters include compute instance types, job submission queues, and security groups. Some examples of settings that cannot be changed as using the pcluster update
command include adding/removing EBS volumes, adding/removing/modifying pre- or post-install scripts, or adding/removing EFS/FSx filesystems. Furthermore, using the pcluster update
command will not add support for features that are part of later versions of ParallelCluster to your cluster (e.g. a cluster created with version 2.8.1 cannot add support for the heterogeneous instance types feature released in version 2.9.0 without first upgrading the version of ParallelCluster being used and recreating the cluster). Instead, incorporating any of these changes would require you to recreate your cluster. The class of changes in which you would first install a later version of ParallelCluster before recreating a cluster are referred to as upgrades.
To assist customers with the upgrade process, we have created a non-exhaustive list of recommendations and best practices to consider when performing an upgrade action that requires the cluster to be recreated.
-
If you are using EBS volumes in your existing cluster that you wish to reuse, you can snapshot those volumes so that you can attach a duplicate of these volumes to your new cluster.
-
If you are using an FSx for Lustre file system you may wish to synchronize all of your data back to S3. This will ensure that if you create a new file system or use the same one that you have access to all of the same data you have on your existing cluster. For more details on how to share resources between clusters or how to reuse an existing Elastic File System (EFS), you can review the information here.
-
Check the version of ParallelCluster by using the
pcluster version
command and verify that this matches the version you wish to use. If you want to ensure that you are using the very latest version of ParallelCluster, you can run pip3 install --upgrade aws-parallelcluster. Note that upgrading to the latest version of ParallelCluster through pip will not upgrade the version of any currently running clusters. If you wish to run clusters using multiple versions of ParallelCluster from a single account, you may wish to separate Python virtual environment for each version of ParallelCluster you need to run. Running clusters will continue to operate with the version of ParallelCluster with which they were created. You can verify the version of ParallelCluster running for your clusters by using the pcluster list command (with an optional -r argument to specify the region for which you want this information). -
If you are using a custom AMI, you will need to recreate your AMI by following the instructions in our documentation (AMI Customization) to ensure it is compatible with the version of ParallelCluster you wish to use.
-
In some cases, you may need to update the configuration file that was used to create your cluster (by default the filepath for this is
~/.parallelcluster/config
). For example, in versions of ParallelCluster >= 2.7.0 we require you to specify the job scheduler you wish to use, whereas if this parameter was omitted we assumed a default value ofsge
. We recommend savings this configuration as a new file in the event that you wish to refer back later on to the configuration used to create your first cluster. In the case of upgrading to a version >= 2.9.0, you may also benefit from using thepcluster-config convert
utility (see: pcluster-config), which can be used to update a ParallelCluster configuration file to be compatible with the functionality available in the latest version of ParallelCluster. -
Once you have prepared your new cluster for creation, you can create it using the pcluster create -c command, supplying the filepath to your new configuration file.
-
As a best practice we recommend logging in to your new cluster and submitting sample jobs representative of your workload before deleting your previous cluster. You should note that it’s likely that some components or packages may have changed between versions (such as the version of Slurm, Open MPI, or CUDA packaged inside of the ParallelCluster AMI).
-
Once you’ve been able to verify that your new cluster is working as intended, you may wish to stop or delete your old cluster using the pcluster stop or pcluster delete commands, respectively. Be advised that the pcluster stop command will leave the cluster’s head node and any attached file systems running and that the pcluster delete command will permanently delete all CloudFormation resources associated with your cluster. You will not be able to restore this cluster once it is deleted, so we advise doing this only after you are certain that you are ready to transition all of your workloads to your new cluster.