1-1 restore service implementation #4199

karol-kokoszka · 2025-01-13T12:06:11Z

This issue is about creating the full service allowing to perform 1-1 restore of the VNode keyspaces/column familes.
Design doc (https://docs.google.com/document/d/18jhhNo90JWy6fIgPCks3RI1zhHLI7eBLKpHGJRBmdpM/edit?tab=t.0#bookmark=id.e6nd5pbe9bjq)

The service must work on the following input:

nodes mapping ( map[string]string where source node host ID is a key and the destination node host ID is the value )
source-cluster-id (will be needed to find the snapshot)
snapshot-id (it identifies which snapshot of the source-cluster to use to restore the data)
backup location

The restore process consists of the following stages:

validation
The service must check if agents can reach the backup location.
The service must compare if the cluster topology saved to the snapshot is the same as the destination cluster topology.
The service must compare if the source cluster ring topology (read from manifest files) is the same as the cluster ring of the destination cluster.
This process can be split into two parts (reading source cluster topology, reading destination cluster topology)
If validation fails, the service must stop and return meaningful error information
copy-data
nodes mapping informs which destination node is going to restore SSTables of which source node. SSTables of the source node must be copied to the corresponding /keyspace/column_family/upload directories on the destination node.
There must be (number_of_nodes) independent go-routines per destination node copying SSTables from the corresponding source node.
If the corresponding destination folder doesn't exists then it means that one of the prerequisities is not met (schema is not restored). It should just return error informing that the keyspace/column family is missing.
refresh SSTables
After the SSTables are copied to the upload directories, SM must call

scylla-manager/v3/swagger/scylla_v1.json

Line 11841 in b501137

"/storage_service/sstables/{keyspace}": {

on each and every node for every keyspace and column family that was copied to this node.

Michal-Leszczynski · 2025-01-14T09:26:58Z

The service must work on the following input:
nodes mapping ( map[string]string where source node host ID is a key and the destination node host ID is the value )
source-cluster-id (will be needed to find the snapshot)
snapshot-id (it identifies which snapshot of the source-cluster to use to restore the data)
backup location

SM can establish node mapping on its own during validation stage - we just look for the nodes with the same token ownership.

Source cluster ID is also not needed - we don't need it for finding the snapshot in the regular restore, although it forces us to traverse the whole meta dir, which is annoying, but it shouldn't be big.

Michal-Leszczynski · 2025-01-14T09:28:49Z

Also, the 1-1 restore is for data only and the schema is supposed to be restored in the regular way, right?

Michal-Leszczynski · 2025-01-14T09:30:35Z

But why do we want to introduce a separate service for that? Shouldn't it be a part of the restore service, perhaps just a new method like Restore1To1? Or a flag to the regular restore?

karol-kokoszka · 2025-01-14T09:30:51Z

SM can establish node mapping on its own during validation stage - we just look for the nodes with the same token ownership.

Why so, it's easier to expect this input. Especially that the prerequisite for 1-1 restore with SM is to clone the cluster.

Source cluster ID is also not needed - we don't need it for finding the snapshot in the regular restore, although it forces us to traverse the whole meta dir, which is annoying, but it shouldn't be big.

We can traverse, but why if the user can just provide the source-cluster id. Especially that path in backup location which is a part of the backup specification forces to use cluster-id.

karol-kokoszka · 2025-01-14T09:32:14Z

But why do we want to introduce a separate service for that? Shouldn't it be a part of the restore service, perhaps just a new method like Restore1To1? Or a flag to the regular restore?

Because it will create the spaghetti with a logic to separate the flow of l&s restore from the flow of the 1-1 restore.
Service is a good processing unit that encapsulates the logic for 1-1 restore.

karol-kokoszka · 2025-01-14T09:32:49Z

Also, the 1-1 restore is for data only and the schema is supposed to be restored in the regular way, right?

Yes, the prereq here is to restore the schema first.

Michal-Leszczynski · 2025-01-14T11:41:38Z

Why so, it's easier to expect this input. Especially that the prerequisite for 1-1 restore with SM is to clone the cluster.

It could also be the very same cluster (backup->truncate->restore). It's still simple to construct such mapping, but it would require to fetch all host ids from the cluster. I wouldn't say that we need to assume that the cloud would be the only user of this API - if we expose it, then it can be used.

In general I would say that we should aim to minimize the amount of API params if we can reliably calculate the information on our own. It's always easier and more convenient to add the API param than to remove/modify it. Not to mention that it simplifies the UX (even if the user is the cloud).

We can traverse, but why if the user can just provide the source-cluster id. Especially that path in backup location which is a part of the backup specification forces to use cluster-id.

The point is that we are already working like that in the regular restore, so perhaps it would be surprising to require it here but not there, but in general this is a good direction, as it helps with the underlying issue of #3873 (snapshot tag conflict).

karol-kokoszka added epic restore labels Jan 13, 2025

karol-kokoszka added this to the 1-1 Restore milestone Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1-1 restore service implementation #4199

1-1 restore service implementation #4199

karol-kokoszka commented Jan 13, 2025

Michal-Leszczynski commented Jan 14, 2025

Michal-Leszczynski commented Jan 14, 2025

Michal-Leszczynski commented Jan 14, 2025

karol-kokoszka commented Jan 14, 2025

karol-kokoszka commented Jan 14, 2025

karol-kokoszka commented Jan 14, 2025

Michal-Leszczynski commented Jan 14, 2025

1-1 restore service implementation #4199

1-1 restore service implementation #4199

Comments

karol-kokoszka commented Jan 13, 2025

Michal-Leszczynski commented Jan 14, 2025

Michal-Leszczynski commented Jan 14, 2025

Michal-Leszczynski commented Jan 14, 2025

karol-kokoszka commented Jan 14, 2025

karol-kokoszka commented Jan 14, 2025

karol-kokoszka commented Jan 14, 2025

Michal-Leszczynski commented Jan 14, 2025