Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1-1 restore service implementation #4199

Open
karol-kokoszka opened this issue Jan 13, 2025 · 7 comments
Open

1-1 restore service implementation #4199

karol-kokoszka opened this issue Jan 13, 2025 · 7 comments

Comments

@karol-kokoszka
Copy link
Collaborator

This issue is about creating the full service allowing to perform 1-1 restore of the VNode keyspaces/column familes.
Design doc (https://docs.google.com/document/d/18jhhNo90JWy6fIgPCks3RI1zhHLI7eBLKpHGJRBmdpM/edit?tab=t.0#bookmark=id.e6nd5pbe9bjq)

The service must work on the following input:

  • nodes mapping ( map[string]string where source node host ID is a key and the destination node host ID is the value )
  • source-cluster-id (will be needed to find the snapshot)
  • snapshot-id (it identifies which snapshot of the source-cluster to use to restore the data)
  • backup location

The restore process consists of the following stages:

  • validation
    The service must check if agents can reach the backup location.
    The service must compare if the cluster topology saved to the snapshot is the same as the destination cluster topology.
    The service must compare if the source cluster ring topology (read from manifest files) is the same as the cluster ring of the destination cluster.
    This process can be split into two parts (reading source cluster topology, reading destination cluster topology)
    If validation fails, the service must stop and return meaningful error information
  • copy-data
    nodes mapping informs which destination node is going to restore SSTables of which source node. SSTables of the source node must be copied to the corresponding /keyspace/column_family/upload directories on the destination node.
    There must be (number_of_nodes) independent go-routines per destination node copying SSTables from the corresponding source node.
    If the corresponding destination folder doesn't exists then it means that one of the prerequisities is not met (schema is not restored). It should just return error informing that the keyspace/column family is missing.
  • refresh SSTables
    After the SSTables are copied to the upload directories, SM must call
    "/storage_service/sstables/{keyspace}": {
    on each and every node for every keyspace and column family that was copied to this node.
@Michal-Leszczynski
Copy link
Collaborator

The service must work on the following input:
nodes mapping ( map[string]string where source node host ID is a key and the destination node host ID is the value )
source-cluster-id (will be needed to find the snapshot)
snapshot-id (it identifies which snapshot of the source-cluster to use to restore the data)
backup location

SM can establish node mapping on its own during validation stage - we just look for the nodes with the same token ownership.

Source cluster ID is also not needed - we don't need it for finding the snapshot in the regular restore, although it forces us to traverse the whole meta dir, which is annoying, but it shouldn't be big.

@Michal-Leszczynski
Copy link
Collaborator

Also, the 1-1 restore is for data only and the schema is supposed to be restored in the regular way, right?

@Michal-Leszczynski
Copy link
Collaborator

But why do we want to introduce a separate service for that? Shouldn't it be a part of the restore service, perhaps just a new method like Restore1To1? Or a flag to the regular restore?

@karol-kokoszka
Copy link
Collaborator Author

SM can establish node mapping on its own during validation stage - we just look for the nodes with the same token ownership.

Why so, it's easier to expect this input. Especially that the prerequisite for 1-1 restore with SM is to clone the cluster.

Source cluster ID is also not needed - we don't need it for finding the snapshot in the regular restore, although it forces us to traverse the whole meta dir, which is annoying, but it shouldn't be big.

We can traverse, but why if the user can just provide the source-cluster id. Especially that path in backup location which is a part of the backup specification forces to use cluster-id.

@karol-kokoszka
Copy link
Collaborator Author

But why do we want to introduce a separate service for that? Shouldn't it be a part of the restore service, perhaps just a new method like Restore1To1? Or a flag to the regular restore?

Because it will create the spaghetti with a logic to separate the flow of l&s restore from the flow of the 1-1 restore.
Service is a good processing unit that encapsulates the logic for 1-1 restore.

@karol-kokoszka
Copy link
Collaborator Author

Also, the 1-1 restore is for data only and the schema is supposed to be restored in the regular way, right?

Yes, the prereq here is to restore the schema first.

@Michal-Leszczynski
Copy link
Collaborator

Why so, it's easier to expect this input. Especially that the prerequisite for 1-1 restore with SM is to clone the cluster.

It could also be the very same cluster (backup->truncate->restore). It's still simple to construct such mapping, but it would require to fetch all host ids from the cluster. I wouldn't say that we need to assume that the cloud would be the only user of this API - if we expose it, then it can be used.

In general I would say that we should aim to minimize the amount of API params if we can reliably calculate the information on our own. It's always easier and more convenient to add the API param than to remove/modify it. Not to mention that it simplifies the UX (even if the user is the cloud).

We can traverse, but why if the user can just provide the source-cluster id. Especially that path in backup location which is a part of the backup specification forces to use cluster-id.

The point is that we are already working like that in the regular restore, so perhaps it would be surprising to require it here but not there, but in general this is a good direction, as it helps with the underlying issue of #3873 (snapshot tag conflict).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants