Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Existing etcd Clusters with etcd-druid #957

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

seshachalam-yv
Copy link
Contributor

How to categorize this PR?

/area control-plane
/area high-availability
/kind enhancement

What this PR does / why we need it:
This PR introduces the necessary changes to etcd-druid to enable seamless migration of existing etcd clusters into etcd-druid managed clusters. It includes updates to the API to support the existingCluster configuration, which allows clusters that are not initially managed by etcd-druid to be integrated without disruption.

  • New Documentation:
    A step-by-step guide is added to demonstrate how to migrate an unmanaged etcd cluster to an etcd-druid managed cluster. By leveraging the existingCluster feature, you can create an Etcd CR configured with your current cluster’s peer URLs and endpoints, prompting etcd-druid to join your existing members. Once the newly managed members are active, you can safely remove the old, unmanaged members. This approach offers a straightforward migration path with minimal downtime.

Which issue(s) this PR fixes:
Fixes #956

Special notes for your reviewer:

Release note:

Added the `existingCluster` configuration, enabling `etcd-druid` to join and manage previously unmanaged `etcd` clusters. By leveraging this feature, users can seamlessly migrate their existing `etcd` clusters to an `etcd-druid` managed setup without requiring a complete rebuild. 

- Introduced `ExistingCluster` struct in `EtcdConfig` to specify configuration for an existing etcd cluster.
- Added `MemberURLs` and `Endpoint` fields to `ExistingCluster` to define the URLs and endpoint of the existing etcd cluster members.
- Updated `prepareInitialCluster` function to include existing cluster member URLs.
- Modified `getBackupRestoreContainerCommandArgs` to append existing cluster endpoint to service endpoints.
@seshachalam-yv seshachalam-yv requested a review from a team as a code owner December 25, 2024 11:07
@gardener-robot gardener-robot added area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Dec 25, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 25, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 25, 2024
@seshachalam-yv
Copy link
Contributor Author

/test pull-etcd-druid-unit

@acumino
Copy link
Member

acumino commented Jan 6, 2025

/assign

Copy link
Contributor

@shreyas-s-rao shreyas-s-rao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seshachalam-yv thanks a lot for adding this functionality! I personally love that druid can now "adopt" existing etcd clusters, and possibly also help with migrating cluster data from one etcd cluster to another without the use of backups and without downtime (useful for gardener live CPM scenario). I have few comments and doubts. Can you please address them? Thanks.


// ExistingCluster defines the configuration for an existing etcd cluster.
type ExistingCluster struct {
// MemberURLs specifies the URLs of the existing etcd cluster members.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring does not give sufficient information about the structure of the MemberURLs field. What do the keys represent? Member names? Member IDs? Quite unclear.

Also, I feel it would be cleaner to rename MemberURLs to Members, which is a slice of ExistingClusterMember type.

type ExistingClusterMember struct {
    Name string
    ID string
    URLs []string
}

Something on these lines, to make the consumption clear and explicit. Using the map key to denote member name/ID becomes confusing for the exact reason that I mentioned above, that it isn't intrinsically clear whether the key represents member name or ID.

@@ -233,6 +233,22 @@ type EtcdConfig struct {
// ClientService defines the parameters of the client service that a user can specify
// +optional
ClientService *ClientService `json:"clientService,omitempty"`
// ExistingCluster specifies the configuration for an existing etcd cluster that the etcd-druid managed cluster will join.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this field immutable? Once the Etcd spec has been created, Spec.Etcd.ExistingCluster can no longer be added or modified.

Also, once the new cluster has joined the existing cluster, will this field still be retained in the Etcd spec? If yes, then what's the behavior of etcd-druid after the user manually removes the old cluster members (assuming they wanted druid to "adopt" their existing cluster)? If Spec.Etcd.ExistingCluster still exists after removal of the old members, will druid still attempt to create the cluster with the old members also, which will lead to a quorum loss since only 3 out of total 6 members will now exist?

I think we need some discussion/clarity on this behavior.

// - https://etcd-1:2380
MemberURLs map[string][]string `json:"memberURLs"`
// Endpoint specifies the endpoint of the existing etcd cluster.
Endpoint string `json:"endpoint"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming to ClientEndpoint makes sense here, since just Endpoint seems ambiguous.

Comment on lines 461 to +462
commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=https://%s:%d", druidv1alpha1.GetClientServiceName(b.etcd.ObjectMeta), b.clientPort))
commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=%s", serviceEndpoint))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--service-endpoints CLI flag is being passed twice, which seems incorrect and could lead to errors from etcdbr.

} else {
commandArgs = append(commandArgs, "--insecure-transport=true")
commandArgs = append(commandArgs, "--insecure-skip-tls-verify=true")
commandArgs = append(commandArgs, fmt.Sprintf("--endpoints=http://%s-local:%d", b.etcd.Name, b.clientPort))
commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=http://%s:%d", druidv1alpha1.GetClientServiceName(b.etcd.ObjectMeta), b.clientPort))
commandArgs = append(commandArgs, fmt.Sprintf("--service-endpoints=%s", serviceEndpoint))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.


## Overview

`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, facilitating seamless migrations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, facilitating seamless migrations.
`etcd-druid` supports the integration of new etcd members into an existing `etcd` cluster through the `etcd.spec.etcd.existingCluster` configuration. This capability ensures that when new managed members are added, they automatically join the pre-existing cluster, allowing etcd-druid to eventually "adopt" the etcd cluster.

### Single-Node `etcd` Cluster Migration

- **From Single-Node to Three-Node:** Transitioning from a single-node to a three-node etcd cluster can be achieved with zero downtime. However, once expanded to a multi-member configuration, reducing back to a single-node setup is not supported by `etcd-druid`.
- **From Single-Node to Single-Node:** Migrations of this nature typically involve some downtime. You are required to capture a snapshot using the [etcd-backup-restore tool](https://github.com/gardener/etcd-backup-restore/blob/master/docs/deployment/getting_started.md#taking-scheduled-snapshot), upload it to your designated backup bucket, and configure your new etcd CR to restore from this snapshot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be downtime involved? The cluster gets scaled up from 1 to 2 members, and as long as both members are healthy, there's quorum. And then the user removes the old member from the cluster via etcdctl member remove call, which scales down the cluster size to 1 member, which still ensures quorum. I don't see why this would 'typically involve some downtime". Yes, there is a possibility of quorum loss if one of the members becomes unhealthy during this migration process. But that's a possibility for a 3-node cluster migration too, if 3 members go down.

Can you please re-word this?

Comment on lines +71 to +77
etcd-0:
- http://etcd-0.etcd:2380
etcd-1:
- http://etcd-1.etcd:2380
etcd-2:
- http://etcd-2.etcd:2380
endpoint: http://etcd:2379
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous example used https, while this uses http. Please correct it.


Collect crucial data about your current cluster configuration:

- **Member URLs:** These are the URLs used for peer communications among cluster members.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain here whether the MemberURL map keys represent the member names of IDs or pod names? Makes it easier for consumption.

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jan 30, 2025
@shreyas-s-rao shreyas-s-rao removed their assignment Jan 30, 2025

if etcd.Spec.Etcd.ExistingCluster != nil {
for memberName, memberURLs := range etcd.Spec.Etcd.ExistingCluster.MemberURLs {
builder.WriteString(fmt.Sprintf("%s=%s,", memberName, strings.Trim(strings.Join(memberURLs, ","), ",")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some guard rails around this, like validation to ensure that the cluster member names and endpoints are actually valid. I would not want the existing cluster to go into an irrecoverable or unusable state due to bad configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also a check to ensure that the existing cluster is actually up and running. If not, then druid will end up trying to add members to a non-existing or non-running cluster, and go into a futile loop.

@shreyas-s-rao
Copy link
Contributor

@seshachalam-yv please add an e2e test for this, since it's an important feature that needs to be tested thoroughly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Migrating Existing etcd Clusters to etcd-druid Managed Clusters
9 participants