Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: add fallback/recovery additions [prototype] #223

Closed
wants to merge 7 commits into from

Conversation

cwaldren-ld
Copy link
Contributor

This PR attempts to demonstrate how fallback/recovery conditions might be implemented in FDv2.

The main idea is that the SDK can fallback to a secondary synchronizer, and then recover to the primary synchronizer at a later date.

Some challenges I've identified when implementing this PR:

  1. Conditions require accurate status information to be evaluated correctly. For example, a condition that executes "when the data source is interrupted for 5 seconds" needs to know that the status is "interrupted" and that it has been interrupted for 5 seconds. So these need to be accessible to the condition, and it's important they be set correctly.

  2. We'll need some way of specifying the conditions in our configuration system. We can start off with simple conditions like: fallback { unhealthy_duration: duration } / recover { healthy_duration: duration, unhealthy_duration: duration}, and specify some reasonable defaults.

  3. The current data sources aren't setup for being restarted. That is, they expect to be started and then closed - they aren't designed specifically to have Sync(..) called multiple times. This means we can't be sure they are safe for the current way I'm using them in the PR.

    • A better design would be having a Run(ctx context.Context, selector) method that is synchronous from the perspective of the Data System algorithm. When the function returns, we know for sure that the data source is closed. Then we can simply call Run again. Another option is to keep the existing Sync/Close pattern, but make sure it is safe for multiple usage (that is, once Close is called, it is safe to call Sync again.)
  4. The existing pattern of closeWhenReady channel is difficult to use in a world with > 1 data source (and ones that can be restarted.) I've added some thoughts about that in the code comments.

@cwaldren-ld cwaldren-ld requested a review from a team as a code owner December 11, 2024 00:22
@cwaldren-ld cwaldren-ld marked this pull request as draft December 11, 2024 00:22
@@ -72,13 +72,10 @@ func (r *pollingRequester) Request() (*fdv2proto.ChangeSet, error) {
r.loggers.Debug("Polling LaunchDarkly for feature flag updates")
}

body, cached, err := r.makeRequest(endpoints.PollingRequestPath)
Copy link
Contributor Author

@cwaldren-ld cwaldren-ld Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hack because our current data sources are started/stopped rather than being re-instantiated. The issue is - if we had an error response (like the payload was malformed or just an HTTP error), we'd get back an empty changeset (.NoChanges()) with no error (nil).

This means if the previous Data Source Status was something like VALID due to a previous synchronizer, then we start up polling and it gets the same response as it did last time, we wouldn't update the state to INTERRUPTED based on this error.

If instead the data source was re-instantiated, then there would be no "previous state" for this new run of the data source. So we'd get the error, update the status, and then get the error again (cached) and not update the status - but that'd be correct from the data system's point of view, since nothing has changed.

@@ -287,6 +287,7 @@ func (sp *StreamProcessor) consumeStream(stream *es.Stream, closeWhenReady chan<
sp.setInitializedAndNotifyClient(true, closeWhenReady)

default:
processedEvent = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect in both the fdv1 and fdv2 sources. If we get an unrecognized event, then we don't want to set the data source state to valid. That would clear any existing error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this shouldn't be a tri-bool and then we have a little different handling, where:

  1. We only set it to true if we have processed a valid event.
  2. Failure to decode an expected event fails would set this to false.
  3. An unknown event type would be ignored, with no change to this value, as a sort of forward compatible guard against new event types we want to add.
  4. We only execute line 294 if it is true, and avoid the false and new third (null?) state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea.

enum ProcessingState {
    EVENT_DECODED,
    EVENT_IGNORED,
    EVENT_MALFORMED
}

or similar.

@@ -111,6 +112,19 @@ func NewFDv2(disabled bool, cfgBuilder subsystems.ComponentConfigurer[subsystems
fdv2.primarySync = cfg.Synchronizers.Primary
fdv2.secondarySync = cfg.Synchronizers.Secondary
fdv2.disabled = disabled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The power of conditions is the chaining I've shown here. We can have an arbitrary number of conditions and hook them up, which I can see being useful in the (far) future. For now, it'd probably be fine to define some preset conditions with a couple of knobs.

fdv2.fallbackCond = func(status interfaces.DataSourceStatus) bool {
interruptedAtRuntime := status.State == interfaces.DataSourceStateInterrupted && time.Since(status.StateSince) > 1*time.Minute
cannotInitialize := status.State == interfaces.DataSourceStateInitializing && time.Since(status.StateSince) > 10*time.Second
healthyForTooLong := status.State == interfaces.DataSourceStateValid && time.Since(status.StateSince) > 30*time.Second
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthyForTooLong is an interesting one. We want this in the recoveryCond in order to prevent the SDK from using the secondary for too long - presumably we want to switch back to the primary because it is more efficient/better for [reasons].

I put it in the fallbackCond to cause a flip-flop pattern for demo purposes. We probably wouldn't actually want that condition in there. Although, it could be useful in a chaos monkey sense - every so often, check that your backup is functioning.

@@ -302,13 +383,17 @@ func (f *FDv2) Offline() bool {
}

//nolint:revive // DataSourceStatusReporter method.
func (f *FDv2) UpdateStatus(status interfaces.DataSourceState, err interfaces.DataSourceErrorInfo) {
func (f *FDv2) UpdateStatus(state interfaces.DataSourceState, err interfaces.DataSourceErrorInfo) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function equivalent in fdv1 is here: https://github.com/launchdarkly/go-server-sdk/blob/v7/internal/datasource/data_source_update_sink_impl.go#L157

The minimal implementation I wrote here is for demo purposes. We may need to adopt the other one to be backwards compatible.

@@ -287,6 +287,7 @@ func (sp *StreamProcessor) consumeStream(stream *es.Stream, closeWhenReady chan<
sp.setInitializedAndNotifyClient(true, closeWhenReady)

default:
processedEvent = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this shouldn't be a tri-bool and then we have a little different handling, where:

  1. We only set it to true if we have processed a valid event.
  2. Failure to decode an expected event fails would set this to false.
  3. An unknown event type would be ignored, with no change to this value, as a sort of forward compatible guard against new event types we want to add.
  4. We only execute line 294 if it is true, and avoid the false and new third (null?) state.

// In the FDv2 world, we have the possibility that a synchronizer fails or we fall back to a secondary synchronizer.
// Perhaps we've already closed the channel, and now a new synchronizer is attempting to do the same.
//
// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" channel that is passed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" channel that is passed
// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" the channel that is passed

}

func (f *FDv2) evaluateCond(ctx context.Context, cond func(status interfaces.DataSourceStatus) bool) error {
ticker := time.NewTicker(10 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard coded 10 second timer is what limits the resolution of the fallback / recovery conditions right?

Copy link
Contributor Author

@cwaldren-ld cwaldren-ld Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I could see an alternative of making this event-triggered, where we'd have "timer event" and "data source status event" (and anything else that can be used as a condition.)

But then we'd need to hold a map of timers, hook into the data source status broadcasters.. it just doesn't seem worth the complexity when compared to a predictable "tick" that polls whatever data is needed.

@keelerm84
Copy link
Member

Done in #242 instead. Feedback from this was adopted there.

So long @cwaldren-ld, and thanks for all the 🐟

@keelerm84 keelerm84 closed this Jan 29, 2025
@keelerm84 keelerm84 deleted the cw/sdk-941-fallback-algo branch January 29, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants