Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: set defaults for ignoredUnrecoverableEvents operator config #1310

Closed
wants to merge 1 commit into from

Conversation

mkuznyetsov
Copy link
Contributor

@mkuznyetsov mkuznyetsov commented Aug 22, 2024

What does this PR do?

Add FailedScheduling event to the default list of ignoredUnrecoverableEvents list in operator config.

(this PR is an alternative to #1306)

the relevant docs should also be updated:
https://eclipse.dev/che/docs/stable/administration-guide/configuring-machine-autoscaling/#_when_the_autoscaler_adds_a_new_node

What issues does this PR fix or reference?

#1280

Is it tested? How?

create a workspace with exceeding resource requests/limits (modified samples/plain.yaml):

apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: plain-devworkspace
spec:
  started: true
  routingClass: 'basic'
  template:
    components:
      - name: web-terminal
        container:
          image: quay.io/wto/web-terminal-tooling:next
          memoryRequest: 1000Gi
          memoryLimit: 1000Gi
          mountSources: true
          command:
           - "tail"
           - "-f"
           - "/dev/null"

check the workspace status, which will keep trying to start workspace, until it times out in 5 minutes:

$ kdw get dw
NAME                 DEVWORKSPACE ID             PHASE    INFO
plain-devworkspace   workspace8e15dba59ab04607   Failed   DevWorkspace failed to progress past step 'Waiting for workspace deployment' for longer than timeout (5m). Ignored events: Detected unrecoverable event FailedScheduling: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod...

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

Copy link

openshift-ci bot commented Aug 22, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkuznyetsov
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Collaborator

@AObuchow AObuchow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mkuznyetsov :)
Please run make fmt but make sure you have goimports installed as well, as the format CI check is currently failing: go install golang.org/x/tools/cmd/goimports@latest

Some thoughts:

I think there's 3 important cases to test:

  1. Is the FailedScheduling event ignored by default? Your current test case covers this.
  2. Can users remove the FailedScheduling event from the ignoredUnrecoverableEvents list? In my testing, this is possible by setting ignoredUnrecoverableEvents to an empty array [] -- however, just adding ignoredUnrecoverableEvents:, won't work. To test this do a kubectl edit dwoc -n $NAMESPACE:

The following works:

apiVersion: controller.devfile.io/v1alpha1
config:
  routing:
    clusterHostSuffix: 192.168.49.2.nip.io
    defaultRoutingClass: basic
  workspace:
+    ignoredUnrecoverableEvents: []
    imagePullPolicy: Always
    progressTimeout: 60s
kind: DevWorkspaceOperatorConfig

The following will not work:

apiVersion: controller.devfile.io/v1alpha1
config:
  routing:
    clusterHostSuffix: 192.168.49.2.nip.io
    defaultRoutingClass: basic
  workspace:
+    ignoredUnrecoverableEvents:
    imagePullPolicy: Always
    progressTimeout: 60s
kind: DevWorkspaceOperatorConfig

IMO, this behaviour is acceptable.

  1. What happens when we add an extra ignoredUnrecoverableEvent? Does it merge the user-provided event(s) with the default event list (that contains FailedScheduling)? Or does it overwrite the default list with the user-provided event(s) list.

Since the DWOC CR doesn't currently show that the FailedScheduling event is being ignored, I would expect it to overwrite the default list with the user-provided list.

However, merging the default event list with the user-provided list might make sense if we use Kubebuilder annotations to set the default value in the CR level as well.

// if a transient cluster issue is triggering false-positives (for example, if
// the cluster occasionally encounters FailedScheduling events). Events listed
// here will not trigger DevWorkspace failures.
// be ignored when deciding to fail a DevWorkspace startup.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure we need to mention the cluster auto-scaler in DWO (or rewrite the docs here). It might be better to mention this in the Che Cluster CRD documentation, since the ignoredUnrecoverableEvents can be configured from the Che Cluster CRD.

Instead, I would suggest:

  • Mentioning "By default, the FailedScheduling is ignored"
  • Removing the "(for example, if the cluster occasionally encounters FailedScheduling events)" since this example is no longer valid now that the FailedScheduling event is ignored by default

// For example, a FailedScheduling event, that occurs when workspace cannot start
// due to exceeding available resources, should not fail the workspace startup, if there is
// an autoscaler configured on the cluster, and we want to wait until it provisions additional resources.
// FailedScheduling event can also occur as a false-positive, as a result of a transient cluster issue.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest experimenting with kubebuilder annotations for the IgnoredUnrecoverableEvents field.

We should try setting the default array value. I think this would be done with +kubebuilder:default:={"FailedScheduling"}

I believe that should be enough to populate the IgnoredUnrecoverableEvents list in the DWOC. Make sure you re-generate the CRD's in a seperate commit by running: make update_devworkspace_api update_devworkspace_crds generate_all

Something to note: This entire PR might be dropped and re-implemented in Che-Operator if we can get the kubebuilder approach working. We'd want Che admins to see that the FailedSchedling event is ignored by default & there would be no advantages to duplicating this code change in both DWO & Che-Operator (unless users who use DWO in isolation want this feature, however, this is not the current reason why we're resolving #1280).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants