(2.2.1 3.3.0) Risk of deletion of managed FSx for Lustre file system when updating a cluster

The issue

Starting from version 2.2.1 or later, AWS customers using ParallelCluster can use the “managed storage” feature to let ParallelCluster create and manage the shared file system associated with the cluster. Managed Storage refers to a volume or file system that is fully handled by ParallelCluster and its lifecycle is tied to the associated cluster. This means that ParallelCluster can create and delete the file system.

We have identified an issue wherein when you update a subnet or security group in the ParallelCluster configuration using an update-cluster API or CLI operation and you have a managed FSx for Lustre file system associated with the cluster, ParallelCluster will create a file system with the new subnet or new security group and delete the old file system. If you have not enabled storage backups, this sequence of operations will result in data loss as the data is not automatically copied over to the newly created file system while the old file system gets deleted.

Affected versions

This issue impacts all ParallelCluster versions 2.2.1 or greater, in particular here are ParallelCluster version specific configuration options that when changed trigger the replacement of the FSx Lustre file system:

For ParallelCluster 3.3.0: Scheduling/SlurmQueues/Networking/SubnetIds
For ParallelCluster versions from 3.0.0 to 3.2.1: Scheduling/SlurmQueues/Networking/SubnetIds and/or Scheduling/SlurmQueues/Networking/SecurityGroups
For ParallelCluster versions from 2.2.1 to 2.11.8: vpc-security-group-id

Identifying managed FSx file systems configured for a cluster

For ParallelCluster 3.x versions, managed ParallelCluster file systems include a resource tag that holds the name of the linked cluster. To identify the Lustre file system that is attached to the cluster either use the aws fsx describe-file-systems or the AWS console and look for a file system that has a tag parallelcluster:cluster-name which points to the cluster name.

For ParallelCluster 2.x versions, use the aws:cloudformation:stack-name tag instead. The tag has value reflecting the name of the cluster it is attached to prefixed by the parallelcluster- keyword (parallelcluster-<cluster-name>).

Mitigation for versions < 3.3.0

Backup your data and create a new cluster

In case you want to update the cluster subnets and security-groups settings you’ll have to back-up your data and create a new cluster with the updated settings. To retain the data in the existing FSx for Lustre managed filesystem, you will need to backup the data first and restore the data into the new filesystem on the newly created cluster. For more information, see Working with backups in the FSx for Lustre User Guide. Note that FSx for Lustre scratch based file systems don’t support backups, the method described here is only applicable for FSx for Lustre file systems that are of DeploymentType PERSISTENT.

If you would like to use a Managed FSx lustre file system on the new cluster you can use the BackupID setting to point to the backup snapshot and ParallelCluster will restore the backup to the new FSx for Lustre file system created. The example below shows the FSx for Lustre configuration that uses the BackupId setting to specify the backup snapshot to restore to a newly created Managed FSx for Lustre filesystem, created when the new cluster is created.

- MountDir: /fsx
    Name: fsx
    StorageType: FsxLustre
    FsxLustreSettings:
      BackupId: backup-0c816925eab6d91d6

If you would like to create and manage the file system externally and have the backup restored to the new file system you will need to first create the new file system and restore the backup to the new file system. You can then create a new cluster specifying a file system configuration that points to the FileSystemId of externally created file system. The example below shows the ParallelCluster storage configuration that uses the FileSystemId to point to the externally created file system. ParallelCluster will create the new cluster and mount the externally managed file system as part of the cluster creation process. For more information on how to restoring a backup snapshot to a file system created refer the Working with backups section in the FSx for Lustre User Guide.

- MountDir: /fsx
    Name: fsx
    StorageType: FsxLustre
    FsxLustreSettings:
      FileSystemId: fs-02e5b4b4abd62d51c

Updating a cluster with a managed FSx for Lustre Scratch file system

If you have a managed FSx for Lustre file system that is scratch based and you need to update the cluster subnets and security-groups settings you will need to create a new cluster with the updated setting. You can then delete the earlier cluster after copying over the needed data from managed FSx for Lustre scratch file system of the earlier cluster to the managed FSx for Lustre scratch file system of the new one. Note that FSx Lustre file systems that are scratch based do not support backups.

Mitigation for version 3.3.0

Before proceeding with any of the options below we highly recommended you to back-up your data. For more information, see Working with backups in the FSx for Lustre User Guide.

Transform the file system from managed to unmanaged

With Parallelcluster 3.3 we have introduced a new DeletionPolicy parameter as part of the Shared Storage configuration that allows for taking full ownership of a managed file system and convert it to an unmanaged one. There are no specific requirements for taking a backup of your FSx Lustre file system although we would recommend to backup your file system as a best practice.

To change ownership of a managed file system to an unmanaged one, first set the DeletionPolicy parameter to Retain on the unmanaged file system, as shown in the example below, and update the cluster.

  - MountDir: /fsx
    Name: fsx-2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      AutomaticBackupRetentionDays: 1
      DeploymentType: PERSISTENT_1
      PerUnitStorageThroughput: 50
      DeletionPolicy: Retain

Once the cluster has been updated, Identify the filesystem ID using the method described earlier in the “Identifying managed FSx file systems configured for a cluster“ section.

Next, delete the existing configuration from the configuration file and add a new configuration specifying the file system Id as shown in the example below and update the cluster once more.

  - MountDir: /fsx
    Name: fsx-new
    StorageType: FsxLustre
    FsxLustreSettings:
      FileSystemId: fs-02e5b4b4abd62d51c

After the update the cluster will reflect the same file system but its life-cycle will no longer be controlled by ParallelCluster.

Restore the backup in a newly created managed file system

Once you have backed up your FSx Lustre file system you can create a new managed file system and restore the backup upon updating the cluster. You can use the BackupId ParallelCluster setting or follow the standard procedure documented by FSx. Note that FSx for Lustre scratch based file systems don’t support backups, the method described here is only applicable for FSx for Lustre file systems that are of PERSISTENT deployment type.

The example shown below, shows the managed FSx for luster configuration on the left. Once you backup the file system, you can create a new managed FSx for Luster file system on the cluster by modifying the existing managed file system configuration as shown the example on the right. Changing the file system name will create a new file system and delete the older one automatically. ParallelCluster will also restore the backup specified by the BackupId to the new Managed file system created.

  - MountDir: /fsx
    Name: fsx
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      AutomaticBackupRetentionDays: 1
      DeploymentType: PERSISTENT_1
      PerUnitStorageThroughput: 50

  - MountDir: /fsx
    Name: fsx-2
    StorageType: FsxLustre
    FsxLustreSettings:
      BackupId: backup-0c816925eab6d91d6
      AutomaticBackupRetentionDays: 1

Updating a cluster with a managed FSx for Lustre Scratch file system

If you have a managed FSx for Lustre file system that is scratch based you can simply assign a new name to the existing file system configuration, upon update ParallelCluster will create a new managed FSx for luster file system and delete the previously configured one. Note that Managed FSx Lustre file systems that are scratch based do not support backup, hence if you need to retain your data, before updating your cluster, copy over the required data to a temporary location such as an S3 folder and copy over the data back to the newly created managed FSx for Lustre filesystem. The example below shows how you can change the Name setting on an existing configuration on the left to the new configuration on the right before running the cluster update commands.

  - MountDir: /fsx
    Name: fsx-scratch
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2

  - MountDir: /fsx
    Name: fsx-scratch-2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2

Note: You can upgrade to 3.3.1 which prevent managed FSx for Lustre file systems to be replaced during a cluster update avoiding to support changes on the compute fleet subnet id.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(2.2.1 3.3.0) Risk of deletion of managed FSx for Lustre file system when updating a cluster

The issue

Affected versions

Identifying managed FSx file systems configured for a cluster

Mitigation for versions < 3.3.0

Backup your data and create a new cluster

Updating a cluster with a managed FSx for Lustre Scratch file system

Mitigation for version 3.3.0

Transform the file system from managed to unmanaged

Restore the backup in a newly created managed file system

Updating a cluster with a managed FSx for Lustre Scratch file system

Clone this wiki locally