-
Notifications
You must be signed in to change notification settings - Fork 312
(2.2.1 3.3.0) Risk of deletion of managed FSx for Lustre file system when updating a cluster
Starting from version 2.2.1 or later, AWS customers using ParallelCluster can use the “managed storage” feature to let ParallelCluster create and manage the shared file system associated with the cluster. Managed Storage refers to a volume or file system that is fully handled by ParallelCluster and its lifecycle is tied to the associated cluster. This means that ParallelCluster can create and delete the file system.
We have identified an issue wherein when you update a subnet or security group in the ParallelCluster configuration using an update-cluster API or CLI operation and you have a managed FSx for Lustre file system associated with the cluster, ParallelCluster will create a file system with the new subnet or new security group and delete the old file system. If you have not enabled storage backups, this sequence of operations will result in data loss as the data is not automatically copied over to the newly created file system while the old file system gets deleted.
This issue impacts all ParallelCluster versions 2.2.1 or greater, in particular here are ParallelCluster version specific configuration options that when changed trigger the replacement of the FSx Lustre file system:
- For ParallelCluster 3.3.0: Scheduling/SlurmQueues/Networking/SubnetIds
- For ParallelCluster versions from 3.0.0 to 3.2.1: Scheduling/SlurmQueues/Networking/SubnetIds and/or Scheduling/SlurmQueues/Networking/SecurityGroups
- For ParallelCluster versions from 2.2.1 to 2.11.8: vpc-security-group-id
For ParallelCluster 3.x versions, managed ParallelCluster file systems include a resource tag that holds the name of the linked cluster. To identify the Lustre file system that is attached to the cluster either use the aws fsx describe-file-systems
or the AWS console and look for a file system that has a tag parallelcluster:cluster-name
which points to the cluster name.
For ParallelCluster 2.x versions, use the aws:cloudformation:stack-name
tag instead. The tag has value reflecting the name of the cluster it is attached to prefixed by the parallelcluster-
keyword (parallelcluster-<cluster-name>
).
In case you want to update the cluster subnets and security-groups settings you’ll have to back-up your data and create a new cluster with the updated settings. To retain the data in the existing FSx for Lustre managed filesystem, you will need to backup the data first and restore the data into the new filesystem on the newly created cluster. For more information, see Working with backups in the FSx for Lustre User Guide. Note that FSx for Lustre scratch based file systems don’t support backups, the method described here is only applicable for FSx for Lustre file systems that are of DeploymentType PERSISTENT.
If you would like to use a Managed FSx lustre file system on the new cluster you can use the BackupID setting to point to the backup snapshot and ParallelCluster will restore the backup to the new FSx for Lustre file system created. The example below shows the FSx for Lustre configuration that uses the BackupId
setting to specify the backup snapshot to restore to a newly created Managed FSx for Lustre filesystem, created when the new cluster is created.
- MountDir: /fsx
Name: fsx
StorageType: FsxLustre
FsxLustreSettings:
BackupId: backup-0c816925eab6d91d6
If you would like to create and manage the file system externally and have the backup restored to the new file system you will need to first create the new file system and restore the backup to the new file system. You can then create a new cluster specifying a file system configuration that points to the FileSystemId of externally created file system. The example below shows the ParallelCluster storage configuration that uses the FileSystemId to point to the externally created file system. ParallelCluster will create the new cluster and mount the externally managed file system as part of the cluster creation process. For more information on how to restoring a backup snapshot to a file system created refer the Working with backups section in the FSx for Lustre User Guide.
- MountDir: /fsx
Name: fsx
StorageType: FsxLustre
FsxLustreSettings:
FileSystemId: fs-02e5b4b4abd62d51c
If you have a managed FSx for Lustre file system that is scratch based and you need to update the cluster subnets and security-groups settings you will need to create a new cluster with the updated setting. You can then delete the earlier cluster after copying over the needed data from managed FSx for Lustre scratch file system of the earlier cluster to the managed FSx for Lustre scratch file system of the new one. Note that FSx Lustre file systems that are scratch based do not support backups.
Before proceeding with any of the options below we highly recommended you to back-up your data. For more information, see Working with backups in the FSx for Lustre User Guide.
With Parallelcluster 3.3 we have introduced a new DeletionPolicy parameter as part of the Shared Storage configuration that allows for taking full ownership of a managed file system and convert it to an unmanaged one. There are no specific requirements for taking a backup of your FSx Lustre file system although we would recommend to backup your file system as a best practice.
To change ownership of a managed file system to an unmanaged one, first set the DeletionPolicy parameter to Retain on the unmanaged file system, as shown in the example below, and update the cluster.
- MountDir: /fsx
Name: fsx-2
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
AutomaticBackupRetentionDays: 1
DeploymentType: PERSISTENT_1
PerUnitStorageThroughput: 50
DeletionPolicy: Retain
Once the cluster has been updated, Identify the filesystem ID using the method described earlier in the “Identifying managed FSx file systems configured for a cluster“ section.
Next, delete the existing configuration from the configuration file and add a new configuration specifying the file system Id as shown in the example below and update the cluster once more.
- MountDir: /fsx
Name: fsx-new
StorageType: FsxLustre
FsxLustreSettings:
FileSystemId: fs-02e5b4b4abd62d51c
After the update the cluster will reflect the same file system but its life-cycle will no longer be controlled by ParallelCluster.
Once you have backed up your FSx Lustre file system you can create a new managed file system and restore the backup upon updating the cluster. You can use the BackupId
ParallelCluster setting or follow the standard procedure documented by FSx. Note that FSx for Lustre scratch based file systems don’t support backups, the method described here is only applicable for FSx for Lustre file systems that are of PERSISTENT deployment type.
The example shown below, shows the managed FSx for luster configuration on the left. Once you backup the file system, you can create a new managed FSx for Luster file system on the cluster by modifying the existing managed file system configuration as shown the example on the right. Changing the file system name will create a new file system and delete the older one automatically. ParallelCluster will also restore the backup specified by the BackupId to the new Managed file system created.
|
|
If you have a managed FSx for Lustre file system that is scratch based you can simply assign a new name to the existing file system configuration, upon update ParallelCluster will create a new managed FSx for luster file system and delete the previously configured one. Note that Managed FSx Lustre file systems that are scratch based do not support backup, hence if you need to retain your data, before updating your cluster, copy over the required data to a temporary location such as an S3 folder and copy over the data back to the newly created managed FSx for Lustre filesystem. The example below shows how you can change the Name
setting on an existing configuration on the left to the new configuration on the right before running the cluster update commands.
|
|
Note: You can upgrade to 3.3.1 which prevent managed FSx for Lustre file systems to be replaced during a cluster update avoiding to support changes on the compute fleet subnet id.