refactor: add fallback/recovery additions [prototype] #223

cwaldren-ld · 2024-12-11T00:22:00Z

This PR attempts to demonstrate how fallback/recovery conditions might be implemented in FDv2.

The main idea is that the SDK can fallback to a secondary synchronizer, and then recover to the primary synchronizer at a later date.

Some challenges I've identified when implementing this PR:

Conditions require accurate status information to be evaluated correctly. For example, a condition that executes "when the data source is interrupted for 5 seconds" needs to know that the status is "interrupted" and that it has been interrupted for 5 seconds. So these need to be accessible to the condition, and it's important they be set correctly.
We'll need some way of specifying the conditions in our configuration system. We can start off with simple conditions like: fallback { unhealthy_duration: duration } / recover { healthy_duration: duration, unhealthy_duration: duration}, and specify some reasonable defaults.
The current data sources aren't setup for being restarted. That is, they expect to be started and then closed - they aren't designed specifically to have Sync(..) called multiple times. This means we can't be sure they are safe for the current way I'm using them in the PR.
- A better design would be having a Run(ctx context.Context, selector) method that is synchronous from the perspective of the Data System algorithm. When the function returns, we know for sure that the data source is closed. Then we can simply call Run again. Another option is to keep the existing Sync/Close pattern, but make sure it is safe for multiple usage (that is, once Close is called, it is safe to call Sync again.)
The existing pattern of closeWhenReady channel is difficult to use in a world with > 1 data source (and ones that can be restarted.) I've added some thoughts about that in the code comments.

cwaldren-ld · 2024-12-11T00:27:13Z

internal/datasourcev2/polling_http_request.go

@@ -72,13 +72,10 @@ func (r *pollingRequester) Request() (*fdv2proto.ChangeSet, error) {
 		r.loggers.Debug("Polling LaunchDarkly for feature flag updates")
 	}

-	body, cached, err := r.makeRequest(endpoints.PollingRequestPath)


This is a hack because our current data sources are started/stopped rather than being re-instantiated. The issue is - if we had an error response (like the payload was malformed or just an HTTP error), we'd get back an empty changeset (.NoChanges()) with no error (nil).

This means if the previous Data Source Status was something like VALID due to a previous synchronizer, then we start up polling and it gets the same response as it did last time, we wouldn't update the state to INTERRUPTED based on this error.

If instead the data source was re-instantiated, then there would be no "previous state" for this new run of the data source. So we'd get the error, update the status, and then get the error again (cached) and not update the status - but that'd be correct from the data system's point of view, since nothing has changed.

cwaldren-ld · 2024-12-11T00:32:34Z

internal/datasourcev2/streaming_data_source.go

@@ -287,6 +287,7 @@ func (sp *StreamProcessor) consumeStream(stream *es.Stream, closeWhenReady chan<
 				sp.setInitializedAndNotifyClient(true, closeWhenReady)

 			default:
+				processedEvent = false


This seems incorrect in both the fdv1 and fdv2 sources. If we get an unrecognized event, then we don't want to set the data source state to valid. That would clear any existing error.

I'm wondering if this shouldn't be a tri-bool and then we have a little different handling, where:

We only set it to true if we have processed a valid event.

Failure to decode an expected event fails would set this to false.

An unknown event type would be ignored, with no change to this value, as a sort of forward compatible guard against new event types we want to add.

We only execute line 294 if it is true, and avoid the false and new third (null?) state.

Seems like a good idea.

enum ProcessingState { EVENT_DECODED, EVENT_IGNORED, EVENT_MALFORMED }

or similar.

cwaldren-ld · 2024-12-11T00:36:43Z

internal/datasystem/fdv2_datasystem.go

@@ -111,6 +112,19 @@ func NewFDv2(disabled bool, cfgBuilder subsystems.ComponentConfigurer[subsystems
 	fdv2.primarySync = cfg.Synchronizers.Primary
 	fdv2.secondarySync = cfg.Synchronizers.Secondary
 	fdv2.disabled = disabled


The power of conditions is the chaining I've shown here. We can have an arbitrary number of conditions and hook them up, which I can see being useful in the (far) future. For now, it'd probably be fine to define some preset conditions with a couple of knobs.

cwaldren-ld · 2024-12-11T00:38:14Z

internal/datasystem/fdv2_datasystem.go

+	fdv2.fallbackCond = func(status interfaces.DataSourceStatus) bool {
+		interruptedAtRuntime := status.State == interfaces.DataSourceStateInterrupted && time.Since(status.StateSince) > 1*time.Minute
+		cannotInitialize := status.State == interfaces.DataSourceStateInitializing && time.Since(status.StateSince) > 10*time.Second
+		healthyForTooLong := status.State == interfaces.DataSourceStateValid && time.Since(status.StateSince) > 30*time.Second


healthyForTooLong is an interesting one. We want this in the recoveryCond in order to prevent the SDK from using the secondary for too long - presumably we want to switch back to the primary because it is more efficient/better for [reasons].

I put it in the fallbackCond to cause a flip-flop pattern for demo purposes. We probably wouldn't actually want that condition in there. Although, it could be useful in a chaos monkey sense - every so often, check that your backup is functioning.

cwaldren-ld · 2024-12-11T00:41:59Z

internal/datasystem/fdv2_datasystem.go

@@ -302,13 +383,17 @@ func (f *FDv2) Offline() bool {
 }

 //nolint:revive // DataSourceStatusReporter method.
-func (f *FDv2) UpdateStatus(status interfaces.DataSourceState, err interfaces.DataSourceErrorInfo) {
+func (f *FDv2) UpdateStatus(state interfaces.DataSourceState, err interfaces.DataSourceErrorInfo) {


This function equivalent in fdv1 is here: https://github.com/launchdarkly/go-server-sdk/blob/v7/internal/datasource/data_source_update_sink_impl.go#L157

The minimal implementation I wrote here is for demo purposes. We may need to adopt the other one to be backwards compatible.

keelerm84 · 2024-12-12T14:20:28Z

internal/datasourcev2/streaming_data_source.go

@@ -287,6 +287,7 @@ func (sp *StreamProcessor) consumeStream(stream *es.Stream, closeWhenReady chan<
 				sp.setInitializedAndNotifyClient(true, closeWhenReady)

 			default:
+				processedEvent = false


I'm wondering if this shouldn't be a tri-bool and then we have a little different handling, where:

We only set it to true if we have processed a valid event.

Failure to decode an expected event fails would set this to false.

An unknown event type would be ignored, with no change to this value, as a sort of forward compatible guard against new event types we want to add.

We only execute line 294 if it is true, and avoid the false and new third (null?) state.

keelerm84 · 2024-12-12T14:24:12Z

internal/datasystem/fdv2_datasystem.go

+// In the FDv2 world, we have the possibility that a synchronizer fails or we fall back to a secondary synchronizer.
+// Perhaps we've already closed the channel, and now a new synchronizer is attempting to do the same.
+//
+// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" channel that is passed


Suggested change

// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" channel that is passed

// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" the channel that is passed

keelerm84 · 2024-12-12T14:31:11Z

internal/datasystem/fdv2_datasystem.go

+}
+
+func (f *FDv2) evaluateCond(ctx context.Context, cond func(status interfaces.DataSourceStatus) bool) error {
+	ticker := time.NewTicker(10 * time.Second)


This hard coded 10 second timer is what limits the resolution of the fallback / recovery conditions right?

Correct. I could see an alternative of making this event-triggered, where we'd have "timer event" and "data source status event" (and anything else that can be used as a condition.)

But then we'd need to hold a map of timers, hook into the data source status broadcasters.. it just doesn't seem worth the complexity when compared to a predictable "tick" that polls whatever data is needed.

keelerm84 · 2025-01-29T16:31:39Z

Done in #242 instead. Feedback from this was adopted there.

So long @cwaldren-ld, and thanks for all the 🐟

cwaldren-ld added 6 commits December 6, 2024 16:13

refactor: add fallback algorithm for fdv2

c1f485c

update conditions

3bfbd50

more logging

103ea6b

fallback algo

be01071

merge

30740a6

fix handling of UpdateStatus

b7a24bb

cwaldren-ld requested a review from a team as a code owner December 11, 2024 00:22

cwaldren-ld marked this pull request as draft December 11, 2024 00:22

cwaldren-ld commented Dec 11, 2024

View reviewed changes

make condition take data source status

9f387a9

cwaldren-ld commented Dec 11, 2024

View reviewed changes

keelerm84 reviewed Dec 12, 2024

View reviewed changes

keelerm84 closed this Jan 29, 2025

keelerm84 deleted the cw/sdk-941-fallback-algo branch January 29, 2025 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: add fallback/recovery additions [prototype] #223

refactor: add fallback/recovery additions [prototype] #223

cwaldren-ld commented Dec 11, 2024

cwaldren-ld Dec 11, 2024 •

edited

Loading

cwaldren-ld Dec 11, 2024

keelerm84 Dec 12, 2024

cwaldren-ld Dec 12, 2024

cwaldren-ld Dec 11, 2024

cwaldren-ld Dec 11, 2024

cwaldren-ld Dec 11, 2024

keelerm84 Dec 12, 2024

keelerm84 Dec 12, 2024

keelerm84 Dec 12, 2024

cwaldren-ld Dec 12, 2024 •

edited

Loading

keelerm84 commented Jan 29, 2025

	// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" channel that is passed
	// In that case, we need to guarantee that the channel is closed only once. To do this, we "wrap" the channel that is passed

refactor: add fallback/recovery additions [prototype] #223

refactor: add fallback/recovery additions [prototype] #223

Conversation

cwaldren-ld commented Dec 11, 2024

cwaldren-ld Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwaldren-ld Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

keelerm84 commented Jan 29, 2025

cwaldren-ld Dec 11, 2024 •

edited

Loading

cwaldren-ld Dec 12, 2024 •

edited

Loading