Integrate Existing `etcd` Clusters with `etcd-druid` #957

seshachalam-yv · 2024-12-25T11:07:46Z

How to categorize this PR?

/area control-plane
/area high-availability
/kind enhancement

What this PR does / why we need it:
This PR introduces the necessary changes to etcd-druid to enable seamless migration of existing etcd clusters into etcd-druid managed clusters. It includes updates to the API to support the existingCluster configuration, which allows clusters that are not initially managed by etcd-druid to be integrated without disruption.

New Documentation:
A step-by-step guide is added to demonstrate how to migrate an unmanaged etcd cluster to an etcd-druid managed cluster. By leveraging the existingCluster feature, you can create an Etcd CR configured with your current cluster’s peer URLs and endpoints, prompting etcd-druid to join your existing members. Once the newly managed members are active, you can safely remove the old, unmanaged members. This approach offers a straightforward migration path with minimal downtime.

Which issue(s) this PR fixes:
Fixes #956

Special notes for your reviewer:

Release note:

Added the `existingCluster` configuration, enabling `etcd-druid` to join and manage previously unmanaged `etcd` clusters. By leveraging this feature, users can seamlessly migrate their existing `etcd` clusters to an `etcd-druid` managed setup without requiring a complete rebuild.

- Introduced `ExistingCluster` struct in `EtcdConfig` to specify configuration for an existing etcd cluster. - Added `MemberURLs` and `Endpoint` fields to `ExistingCluster` to define the URLs and endpoint of the existing etcd cluster members.

…ruid managed cluster

- Updated `prepareInitialCluster` function to include existing cluster member URLs. - Modified `getBackupRestoreContainerCommandArgs` to append existing cluster endpoint to service endpoints.

seshachalam-yv · 2024-12-25T11:17:09Z

/test pull-etcd-druid-unit

acumino · 2025-01-06T06:59:59Z

/assign

shreyas-s-rao

@seshachalam-yv thanks a lot for adding this functionality! I personally love that druid can now "adopt" existing etcd clusters, and possibly also help with migrating cluster data from one etcd cluster to another without the use of backups and without downtime (useful for gardener live CPM scenario). I have few comments and doubts. Can you please address them? Thanks.

shreyas-s-rao · 2025-01-30T04:19:48Z

api/v1alpha1/etcd.go

+
+// ExistingCluster defines the configuration for an existing etcd cluster.
+type ExistingCluster struct {
+	// MemberURLs specifies the URLs of the existing etcd cluster members.


The docstring does not give sufficient information about the structure of the MemberURLs field. What do the keys represent? Member names? Member IDs? Quite unclear.

Also, I feel it would be cleaner to rename MemberURLs to Members, which is a slice of ExistingClusterMember type.

type ExistingClusterMember struct { Name string ID string URLs []string }

Something on these lines, to make the consumption clear and explicit. Using the map key to denote member name/ID becomes confusing for the exact reason that I mentioned above, that it isn't intrinsically clear whether the key represents member name or ID.

shreyas-s-rao · 2025-01-30T04:24:22Z

api/v1alpha1/etcd.go

@@ -233,6 +233,22 @@ type EtcdConfig struct {
 	// ClientService defines the parameters of the client service that a user can specify
 	// +optional
 	ClientService *ClientService `json:"clientService,omitempty"`
+	// ExistingCluster specifies the configuration for an existing etcd cluster that the etcd-druid managed cluster will join.


Is it possible to make this field immutable? Once the Etcd spec has been created, Spec.Etcd.ExistingCluster can no longer be added or modified.

Also, once the new cluster has joined the existing cluster, will this field still be retained in the Etcd spec? If yes, then what's the behavior of etcd-druid after the user manually removes the old cluster members (assuming they wanted druid to "adopt" their existing cluster)? If Spec.Etcd.ExistingCluster still exists after removal of the old members, will druid still attempt to create the cluster with the old members also, which will lead to a quorum loss since only 3 out of total 6 members will now exist?

I think we need some discussion/clarity on this behavior.

shreyas-s-rao · 2025-01-30T04:25:19Z

api/v1alpha1/etcd.go

+	//   - https://etcd-1:2380
+	MemberURLs map[string][]string `json:"memberURLs"`
+	// Endpoint specifies the endpoint of the existing etcd cluster.
+	Endpoint string `json:"endpoint"`


Renaming to ClientEndpoint makes sense here, since just Endpoint seems ambiguous.

shreyas-s-rao · 2025-01-30T04:26:54Z

internal/component/statefulset/builder.go

 		commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=https://%s:%d", druidv1alpha1.GetClientServiceName(b.etcd.ObjectMeta), b.clientPort))
+		commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=%s", serviceEndpoint))


--service-endpoints CLI flag is being passed twice, which seems incorrect and could lead to errors from etcdbr.

shreyas-s-rao · 2025-01-30T04:27:08Z

internal/component/statefulset/builder.go

 	} else {
 		commandArgs = append(commandArgs, "--insecure-transport=true")
 		commandArgs = append(commandArgs, "--insecure-skip-tls-verify=true")
 		commandArgs = append(commandArgs, fmt.Sprintf("--endpoints=http://%s-local:%d", b.etcd.Name, b.clientPort))
 		commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=http://%s:%d", druidv1alpha1.GetClientServiceName(b.etcd.ObjectMeta), b.clientPort))
+		commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=%s", serviceEndpoint))


Same as above.

shreyas-s-rao · 2025-01-30T04:31:03Z

docs/usage/migrating-existing-etcd-cluster.md

+
+## Overview
+
+`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, facilitating seamless migrations.


Suggested change

`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, facilitating seamless migrations.

`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, allowing etcd-druid to eventually "adopt" the etcd cluster.

shreyas-s-rao · 2025-01-30T04:34:34Z

docs/usage/migrating-existing-etcd-cluster.md

+### Single-Node `etcd` Cluster Migration
+
+- **From Single-Node to Three-Node:** Transitioning from a single-node to a three-node etcd cluster can be achieved with zero downtime. However, once expanded to a multi-member configuration, reducing back to a single-node setup is not supported by `etcd-druid`.
+- **From Single-Node to Single-Node:** Migrations of this nature typically involve some downtime. You are required to capture a snapshot using the [etcd-backup-restore tool](https://github.com/gardener/etcd-backup-restore/blob/master/docs/deployment/getting_started.md#taking-scheduled-snapshot), upload it to your designated backup bucket, and configure your new etcd CR to restore from this snapshot.


Why would there be downtime involved? The cluster gets scaled up from 1 to 2 members, and as long as both members are healthy, there's quorum. And then the user removes the old member from the cluster via etcdctl member remove call, which scales down the cluster size to 1 member, which still ensures quorum. I don't see why this would 'typically involve some downtime". Yes, there is a possibility of quorum loss if one of the members becomes unhealthy during this migration process. But that's a possibility for a 3-node cluster migration too, if 3 members go down.

Can you please re-word this?

shreyas-s-rao · 2025-01-30T04:35:30Z

docs/usage/migrating-existing-etcd-cluster.md

+        etcd-0:
+          - http://etcd-0.etcd:2380
+        etcd-1:
+          - http://etcd-1.etcd:2380
+        etcd-2: 
+          - http://etcd-2.etcd:2380
+      endpoint: http://etcd:2379


Previous example used https, while this uses http. Please correct it.

shreyas-s-rao · 2025-01-30T04:36:31Z

docs/usage/migrating-existing-etcd-cluster.md

+
+Collect crucial data about your current cluster configuration:
+
+- **Member URLs:** These are the URLs used for peer communications among cluster members.


Can you please explain here whether the MemberURL map keys represent the member names of IDs or pod names? Makes it easier for consumption.

shreyas-s-rao · 2025-01-30T04:50:10Z

internal/component/configmap/etcdconfig.go

+
+	if etcd.Spec.Etcd.ExistingCluster != nil {
+		for memberName, memberURLs := range etcd.Spec.Etcd.ExistingCluster.MemberURLs {
+			builder.WriteString(fmt.Sprintf("%s=%s,", memberName, strings.Trim(strings.Join(memberURLs, ","), ",")))


I think we need some guard rails around this, like validation to ensure that the cluster member names and endpoints are actually valid. I would not want the existing cluster to go into an irrecoverable or unusable state due to bad configuration.

And also a check to ensure that the existing cluster is actually up and running. If not, then druid will end up trying to add members to a non-existing or non-running cluster, and go into a futile loop.

shreyas-s-rao · 2025-01-30T05:58:01Z

@seshachalam-yv please add an e2e test for this, since it's an important feature that needs to be tested thoroughly.

seshachalam-yv added 4 commits December 25, 2024 14:04

make generate

4aa9d7f

Add documentation for migrating an existing etcd cluster to an etcd-d…

6edfa87

…ruid managed cluster

Add support for existing cluster configuration in etcd

5809347

- Updated `prepareInitialCluster` function to include existing cluster member URLs. - Modified `getBackupRestoreContainerCommandArgs` to append existing cluster endpoint to service endpoints.

seshachalam-yv requested a review from a team as a code owner December 25, 2024 11:07

gardener-robot added area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Dec 25, 2024

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 25, 2024

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 25, 2024

renormalize added this to the v0.26.0 milestone Dec 30, 2024

shreyas-s-rao assigned ashwani2k and shreyas-s-rao Jan 2, 2025

shreyas-s-rao assigned ashwani2k, unmarshall and shreyas-s-rao and unassigned ashwani2k and shreyas-s-rao Jan 9, 2025

shreyas-s-rao modified the milestones: v0.26.0, v0.27.0 Jan 13, 2025

ishan16696 unassigned ashwani2k Jan 16, 2025

ialidzhikov mentioned this pull request Jan 17, 2025

Evaluate the ServiceTrafficDistribution feature as replacement of the TopologyAwareHints for topology-aware routing gardener/gardener#10421

Open

15 tasks

shreyas-s-rao requested changes Jan 30, 2025

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Jan 30, 2025

shreyas-s-rao removed their assignment Jan 30, 2025

shreyas-s-rao reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Existing `etcd` Clusters with `etcd-druid` #957

Integrate Existing `etcd` Clusters with `etcd-druid` #957

seshachalam-yv commented Dec 25, 2024

seshachalam-yv commented Dec 25, 2024

acumino commented Jan 6, 2025

shreyas-s-rao left a comment

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao Jan 30, 2025

shreyas-s-rao commented Jan 30, 2025

		commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=https://%s:%d", druidv1alpha1.GetClientServiceName(b.etcd.ObjectMeta), b.clientPort))
		commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=%s", serviceEndpoint))


		## Overview

		`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, facilitating seamless migrations.


		Collect crucial data about your current cluster configuration:

		- Member URLs: These are the URLs used for peer communications among cluster members.

Integrate Existing etcd Clusters with etcd-druid #957

Are you sure you want to change the base?

Integrate Existing etcd Clusters with etcd-druid #957

Conversation

seshachalam-yv commented Dec 25, 2024

seshachalam-yv commented Dec 25, 2024

acumino commented Jan 6, 2025

shreyas-s-rao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreyas-s-rao commented Jan 30, 2025

Integrate Existing `etcd` Clusters with `etcd-druid` #957

Integrate Existing `etcd` Clusters with `etcd-druid` #957