Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable drift detection + takeover experience in work applier #950

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

michaelawyu
Copy link
Contributor

@michaelawyu michaelawyu commented Nov 11, 2024

Description of your changes

This PR includes a new implementation of the work applier that enables the drift detection + takeover experience.

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

  • Integration tests

Special notes for your reviewer

Additional tests will be submitted separately.

@michaelawyu michaelawyu marked this pull request as draft November 11, 2024 17:12
@michaelawyu michaelawyu changed the title feat: enable drift detection + takeover experience in work applier [DRAFT] feat: enable drift detection + takeover experience in work applier [full DRAFT] Nov 11, 2024
pkg/controllers/workapplier/controller.go Outdated Show resolved Hide resolved
pkg/controllers/workapplier/utils.go Outdated Show resolved Hide resolved
//
// This check is done on the Work object scope, and is primarily added to address the case
// where duplicate objects might appear in a Fleet resource envelope and lead to unexpected
// behaviors. Duplication is a non-issue without Fleet resource envelopes, as the Fleet hub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This statement is not true since we apply override policies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Sorry for the confusion; this comment is trying to clarify that at this moment for enveloped objects Fleet does not check for duplication (e.g., it is possible to place two objects with the same GVK/NS/name in an envelope, possibly with diff. specs); one definition will overwrite the other.

checked[wriStr] = true

// Prepare the manifest conditions for the write-ahead process.
manifestCondForWA := prepareManifestCondForWA(wriStr, bundle.id, work.Generation, existingManifestCondQIdx, work.Status.ManifestConditions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "WA" stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! It's for the write-ahead process.

pkg/controllers/workapplier/availability_tracker.go Outdated Show resolved Hide resolved
pkg/controllers/workapplier/controller.go Outdated Show resolved Hide resolved
pkg/controllers/workapplier/drift_detection_takeover.go Outdated Show resolved Hide resolved
klog.ErrorS(err, "Failed to decode the manifest", "ordinal", pieces, "work", klog.KObj(work))
bundle.applyErr = fmt.Errorf("failed to decode manifest: %w", err)
bundle.applyResTyp = ManifestProcessingApplyResultTypeDecodingErred
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't return any error back to the caller, thus there are bundles that have no gvr/ManifestObj. However, those are used extensively as pointers in the rest part of the controller logic. It seems that this can lead to Nullptr panic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are checks like "bundle.applyErr != nil" on some places but I am not sure if it covers all cases. Maybe we can add a step after preProcessManifests to remove those from the bundle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Yeah, at this moment the applier would skip a bundle for the processing step if it has failed the pre-processing; for the bundle removal part, a complication is that since we need to report occurrences of such manifests (cannot be decoded/malformed) to the users, if we remove them from the bundles before processing, in the status refreshing part we would need to take extra steps to make sure that they are re-incorporated, which can also be error-prone I fear.

Would you prefer if I:

a) shuffle the list of the bundles after pre-processing to make sure that bundles that failed pre-processing are set in the back, and before processing slice the array to leave them out; or
b) check for nil GVR/manifest object right before processing and throw an unexpected error?

pkg/controllers/workapplier/utils.go Outdated Show resolved Hide resolved
}

inMemberClusterObjLastAppliedManifestObjHash := inMemberClusterObj.GetAnnotations()[fleetv1beta1.ManifestHashAnnotation]
return manifestObjHash == inMemberClusterObjLastAppliedManifestObjHash, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check if the manifest matches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! If the manifest itself has been updated (i.e., a new version of the manifest has become available), there is no need to do drift detection anymore; we will simply apply the new version.

@michaelawyu michaelawyu changed the title feat: enable drift detection + takeover experience in work applier [full DRAFT] feat: enable drift detection + takeover experience in work applier Dec 17, 2024
@michaelawyu michaelawyu marked this pull request as ready for review December 17, 2024 22:17
@@ -0,0 +1,34 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! This is not a new function (well, the wrapper itself is, but all the code w/in are from the original work applier).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely, even if the metric code is there, it has never run in the original work applier as the label it checks from hasn't been set (by the CRP controller) since we migrated to v1beta1. We probably need to re-write this/re-implement the flow if the metric is still desired.

pkg/controllers/workapplier/suite_test.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants