Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate DRA job configs from a Jinja template #34010

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Dec 19, 2024

  • Implemented job configs generation
  • added make rules to generate and verify generated jobs
  • generated DRA canary jobs

/cc @pohly @kannon92 @SergeyKanzhelev @haircommander

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/config Issues or PRs related to code in /config size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 19, 2024
@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch from 7c3a83c to 2f75bbd Compare December 19, 2024 13:37
@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch 3 times, most recently from d030275 to c5999e8 Compare December 19, 2024 14:13
[ci-node-e2e-cgrpv1-crio-dra]
job_type = pr
description = Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1
cluster = k8s-infra-prow-build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not using the eks-prow-build-cluster?

If not, then cluster can go to DEFAULT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is they are used this way in the current job. I'l get rid of cluster variable and use eks-prow-build-cluster for all jobs.

BTW, there is a difference in the kind jobs:

@@ -80,20 +74,15 @@
         command:
         - runner.sh
         args:
-        - /bin/bash
+        - /bin/sh
         - -xc
-        - |
-          set -ex
-          make WHAT="github.com/onsi/ginkgo/v2/ginkgo k8s.io/kubernetes/test/e2e/e2e.test"
-          curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind
-          kind build node-image --image=dra/node:latest .
-          trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT
-          # Which DRA features exist can change over time.
-          features=( $(grep '"DRA' pkg/features/kube_features.go | sed 's/.*"\(.*\)"/\1/') )
-          echo "Enabling DRA feature(s): ${features[*]}."
-          # Those additional features are not in kind.yaml, but they can be added at the end.
-          kind create cluster --retain --config <(cat test/e2e/dra/kind.yaml; for feature in ${features}; do echo "  ${feature}: true"; done) --image dra/node:latest
-          KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=1h hack/ginkgo-e2e.sh -ginkgo.label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Alpha, Beta, DynamicResourceAllocation$(for feature in ${features}; do echo , ${feature}; done)} && !Flaky && !Slow"
+        - >
+          make WHAT="github.com/onsi/ginkgo/v2/ginkgo k8s.io/kubernetes/test/e2e/e2e.test" &&
+          curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind &&
+          kind build node-image --image=dra/node:latest . &&
+          trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT &&
+          kind create cluster --retain --config test/e2e/dra/kind.yaml --image dra/node:latest &&
+          KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=2h30m hack/ginkgo-e2e.sh -ginkgo.label-filter='Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky'

Is it possible to use the same arguments for both? If so, which one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unified that in #33993 with an if check:

if ${with_all_features:-false}; then
# Which DRA features exist can change over time.
features=( $(grep '"DRA' pkg/features/kube_features.go | sed 's/.*"\(.*\)"/\1/') )
echo "Enabling DRA feature(s): ${features[*]}."
# Those additional features are not in kind.yaml, but they can be added at the end.
kind create cluster --retain --config <(cat test/e2e/dra/kind.yaml; for feature in ${features}; do echo " ${feature}: true"; done) --image dra/node:latest
KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=1h hack/ginkgo-e2e.sh -ginkgo.label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Alpha, Beta, DynamicResourceAllocation$(for feature in ${features}; do echo , ${feature}; done)} && !Flaky && !Slow"
else
kind create cluster --retain --config test/e2e/dra/kind.yaml --image dra/node:latest
KUBERNETES_PROVIDER=local KUBECONFIG=${HOME}/.kube/config GINKGO_PARALLEL_NODES=8 E2E_REPORT_DIR=${ARTIFACTS} GINKGO_TIMEOUT=2h30m hack/ginkgo-e2e.sh -ginkgo.label-filter='Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky'
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, applied. PTAL.

# on a kind cluster with containerd updated to a version with CDI support.
#
# Compared to ci-kind-dra, this one enables all DRA-related features.
[ci-kind-dra-all]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it so that we have common settings for normal periodics, normal presubmits, and canary presubmits?

There's still going to be a lot of duplication if we have to have three copies of this section and the ones below.

The same applies to the actual .jinja template. The entries in the periodics and presubmits should be built from a single source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This makes sense. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Now gen.py generates 3 files: dynamic-resource-allocation-canary.yaml, dynamic-resource-allocation-pull.yaml and dynamic-resource-allocation-ci.yaml from dynamic-resource-allocation.conf and dynamic-resource-allocation.jinja

PTAL.

@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch 3 times, most recently from 3259e4d to 499379c Compare December 20, 2024 12:38
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 20, 2024
@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch from 499379c to 2e1e253 Compare December 20, 2024 15:08
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 20, 2024
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very promising.

How to solve indention was my biggest concern when thinking about how to use Jinja. I am not sure whether this is addressed here (need to check test results).

# limitations under the License.

.PHONY: generate
generate-jobs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match.

job_type = node
description = Runs E2E node tests for Dynamic Resource Allocation beta features with CRI-O using cgroup v1
testgrid_dashboards = sig-node-cri-o, sig-node-dynamic-resource-allocation
skip_report = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any job with skip_report = true? I don't think this needs to be configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ git grep -B5 'skip_report: true'
sig-node-presubmit.yaml-  - name: pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
sig-node-presubmit.yaml-    cluster: k8s-infra-prow-build
sig-node-presubmit.yaml-    optional: true
sig-node-presubmit.yaml-    always_run: false
sig-node-presubmit.yaml-    run_if_changed: 'test/e2e/node/pod_resize.go|pkg/kubelet/kubelet.go|pkg/kubelet/kubelet_pods.go|pkg/kubelet/kuberuntime/kuberuntime_manager.go'
sig-node-presubmit.yaml:    skip_report: true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, "for our jobs". We should only make those things configurable which we need to be configurable - it'll be shorter and more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

testgrid_dashboards = sig-node-cri-o, sig-node-dynamic-resource-allocation
skip_report = false
image_config_file = /home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1-serial.yaml
inject_ssh_public_key = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: this can depend on the job type in the template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can. not all presubmit jobs have this. It depends on a distro/image as far as I remember.

{%- if "containerd" in job_name %}
{%- set testgrid_dashboards = testgrid_dashboards + ", sig-node-containerd" %}
{%- endif %}
- name: {{job_name}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So indention is the same for both periodic and presubmits?

The test bot seems to be stuck, but I suspect that a YAML linter would complain about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fortunately the indentation is the same for presubmits and periodics:

presubmits:
  kubernetes/kubernetes:
  - name: pull-kubernetes-e2e-containerd-gce
periodics:
  # This jobs runs e2e.test with a focus on tests for the Dynamic Resource Allocation feature (currently beta)
  # on a kind cluster with containerd updated to a version with CDI support.
  - name: ci-kind-dra

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if the lists were indented in the canonical way, it would be:

periodics:
- name: ci-kind-dra

YAML doesn't care, but there are stylecheckers which might.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I took that snipped from the existing yaml.
And CI doesn't complain about wrong indentation for this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It runs yamllint, but that doesn't care, so we are good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a last resort we can reindent in gen.py if it's really needed. It will be a little bit ugly though.
I suspect/hope that periodic and presubmit configs have the same indentation level in purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a YAML perspective, the nesting level is different.

I don't remember anymore where, but there are other jobs where the indention is different, which is very annoying when copy-pasting from presubmit to periodic or vice-versa. That made me think that it's enforced. It's not, so it indeed makes much more sense to use the same indention even if it's not "quite right" for periodics.

testgrid-tab-name: {{job_name}}
description: {{description}}
testgrid-alert-email: {{testgrid_alert_email}}
fork-per-release: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canaries shouldn't get forked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,115 @@
{%- if beginning %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file be moved into a templates directory, as in kops?

When I look at the PR sidebar, I currently see four files with the identical dynamic-resource-allocation... as name. Even if we shorten that to dra-, keeping the source file separate would make it stand out more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it can be moved. Should I move .conf file as well?

Personally, I'd prefer flat structure with shorter names, e.g.
dra.conf
dra.jinja
dra-canary.yaml
dra-pull.yaml
dra-ci.yaml

And I hope that this approach can be used for all sig-node jobs and the final list of files will be something like this:
jobs.conf
jobs.jinja
jobs-canary.yaml
jobs-pull.yaml
jobs-ci.yaml

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 20, 2024

@pohly @kannon92 @SergeyKanzhelev @haircommander

Looks very promising.

Thank you. After fixing review comments, I'm going to remove -pull and -ci yamls from this PR, so we can only test -canary.
It would be great if SIG-Node folks would look at this and confirm that this approach is at least acceptable.

I personally like it. Using it would allow us to

  • have presubmit job for every periodic
  • keep them synchronized
  • easily generate canary jobs for testing purposes (e.g. kubetest2)
  • make less mistakes as job configs are automatically generated
  • do less typing and copypasting :)
    etc.

WDYT guys?

@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch 2 times, most recently from 4234630 to 28eda1b Compare December 20, 2024 21:33
@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch from d1abaa3 to 708901f Compare January 2, 2025 09:57
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 2, 2025

/retest

@bart0sh bart0sh marked this pull request as ready for review January 2, 2025 10:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 2, 2025
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 2, 2025

@pohly @haircommander @kannon92 This PR is ready for review now.
It generates only DRA canary jobs and shouldn't break any existing jobs.
The idea is to enable generation of the pull and ci jobs in a separate PRs.

@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch 3 times, most recently from 7298ac3 to d3a621e Compare January 2, 2025 11:19
@pohly
Copy link
Contributor

pohly commented Jan 3, 2025

This PR is ready for review now.
It generates only DRA canary jobs and shouldn't break any existing jobs.

Can we make this PR complete (= generates everything) and then merge the generated canary jobs in advance via a second PR?

The advantage of this approach is:

  • we test the new generated jobs via the canaries without breaking anyone
  • we can merge this PR without any changes once we have that assurance

@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 3, 2025

@pohly I was going to do it in 3 steps:

  • this PR doesn't break anyone and we can test canary jobs
  • enabling pull jobs as a separate PR to double check that we don't break existing pull and ci jobs
  • enabling ci jobs as a separate PR as a last step

Would it work for you this way?

@pohly
Copy link
Contributor

pohly commented Jan 3, 2025

I think two PRs as I had proposed is simpler.

I'm not worried about breaking CI jobs: that has less impact than breaking a presubmit because only a few people will see the breakage.

@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch from d3a621e to 95b8375 Compare January 4, 2025 11:33
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 4, 2025
@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch 2 times, most recently from aa54da3 to cf7fa2d Compare January 4, 2025 11:52
@bart0sh bart0sh mentioned this pull request Jan 4, 2025
@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch from cf7fa2d to 2ed5a55 Compare January 4, 2025 12:00
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 4, 2025

@pohly

Can we make this PR complete (= generates everything) and then merge the generated canary jobs in advance via a second PR?

done: #34070

@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 4, 2025

@pohly BTW, current implementation makes canary job changes a bit awkward. We'll have to either complicate template with {%- if kind == "canary" %} blocks or create PRs without changes in the template and the config, as I did in #34070.

Does this make sense to you?

@pohly
Copy link
Contributor

pohly commented Jan 4, 2025

There's a genuine unit test failure:

hack/generate-jobs.py:79:0: W0311: Bad indentation. Found 25 spaces, expected 24 (bad-indentation)

@pohly
Copy link
Contributor

pohly commented Jan 4, 2025

BTW, current implementation makes canary job changes a bit awkward. We'll have to either complicate template with {%- if kind == "canary" %} blocks or create PRs without changes in the template and the config

If someone modifies the template locally and then submits only the updated canary YAML, it is impossible for others to review or replicate how they where generated. It may also be harder to verify that the changes for the canary jobs then get applied as tested to the actual jobs.

I think I prefer the approach with {%- if kind == "canary" %}. Moving experimental changes from the canary jobs to the production jobs then should consists only of removing those if checks and the else branch.

@elieser1101 can be our first guinea pig user of this approach for producing canary jobs which use kubetest2.

@bart0sh bart0sh force-pushed the PR060-generate-job-configs branch from 2ed5a55 to 1da3e45 Compare January 4, 2025 14:54
spec:
containers:
- image: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241230-3006692a6f-master
- image: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241218-d4b51bc3e8-master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same image as on master.

- org: kubernetes
repo: kubernetes
base_ref: master
path_alias: k8s.io/kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These extra_refs are missing in the new generated CI jobs.

make WHAT="github.com/onsi/ginkgo/v2/ginkgo k8s.io/kubernetes/test/e2e/e2e.test"
curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind
kind build node-image --image=dra/node:latest .
trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT
# Which DRA features exist can change over time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the comments?

make WHAT="github.com/onsi/ginkgo/v2/ginkgo k8s.io/kubernetes/test/e2e/e2e.test"
curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind
kind build node-image --image=dra/node:latest .
trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT
# Which DRA features exist can change over time.
features=( $(grep '"DRA' pkg/features/kube_features.go | sed 's/.*"\(.*\)"/\1/') )
echo "Enabling DRA feature(s): ${features[*]}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the debug output?

@@ -0,0 +1,120 @@
{%- if header %}{%- if kind == "ci" %}periodics:{%- else %}presubmits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this {%- if header %} be dropped?

Comment on lines +40 to +41
{%- endif %}
{%- if job_type == "node" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{%- endif %}
{%- if job_type == "node" %}

Redundant if check?

curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz - -C "${PATH%%:*}/" kind
kind build node-image --image=dra/node:latest .
trap 'kind export logs "${ARTIFACTS}/kind"; kind delete cluster' EXIT
{%- if kind == "canary" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this check "canary"?

This seems to overload the meaning of "canary":

  • For canary PR jobs (= dra-canary.yaml).
  • The "all features enabled" CI and presubmit jobs.

Let's use "canary" for "part of dra-canary.yaml" and something else for feature gates. How about "alpha"?


echo "Verifying generated jobs"
hack/run-in-python-container.sh \
python3 hack/generate-jobs.py config/jobs/kubernetes/sig-node/*.conf --only-verify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with only working on "our" (= SIG Node) jobs here for now. But in a future PR this should get extended to other generated jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/config Issues or PRs related to code in /config area/jobs area/testgrid cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: PRs - Needs Reviewer
Development

Successfully merging this pull request may close these issues.

4 participants