Migrate CRI-O jobs away from `kubernetes_e2e.py` #32567

saschagrunert · 2024-05-06T09:04:12Z

The kubernetes_e2e.py script is deprecated and we should use kubetest2 instead.

All affected tests are listed in https://testgrid.k8s.io/sig-node-cri-o

cc @kubernetes/sig-node-cri-o-test-maintainers

Ref: https://github.com/kubernetes/test-infra/tree/master/scenarios, #20760

The text was updated successfully, but these errors were encountered:

haircommander · 2024-05-06T13:57:33Z

/sig node

k8s-triage-robot · 2024-08-04T14:13:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

saschagrunert · 2024-08-05T07:01:42Z

/remove-lifecycle stale

kannon92 · 2024-08-21T17:20:38Z

/triage accepted
/priority important-longterm

elieser1101 · 2024-09-05T12:07:51Z

Does this still need help? can i start looking at it?

saschagrunert · 2024-09-05T12:14:26Z

@elieser1101 I'd appreciate your eyes on that. 🙏

elieser1101 · 2024-09-05T12:33:50Z

/assign

bart0sh · 2024-12-13T23:15:50Z

Unfortunately using more powerful instance didn't change much for imagefs job. I can still see the same error in the logs.

bart0sh · 2024-12-18T01:58:19Z

@elieser1101 I can see a lot of green kubetest2 jobs in the test grid. Is there anything that prevents replacing kubernetes_e2e.py jobs with them? I did it for splitfs and imagefs jobs as I was involved in fixing them. I can do it for the rest of jobs if needed.

elieser1101 · 2024-12-18T12:45:45Z

@bart0sh thank you very much for the splitfs/imagefs that was a great finding

What would come next is to validate that the kubetest2 are actually working. Meaning, I noticed that some of the jobs are completing but are skipping all the specs. We would like to ensure we are running the jobs properly before replacing the kubernetes_e2e.py jobs.

At the moment im loking at the DRA ones wich were missing some kubetest2 features and this

bart0sh · 2025-01-02T17:28:57Z

@elieser1101 pull-crio-cgroupv2-node-e2e-eviction-kubetest2 fails with Context was cancelled (cause: suite timeout occurred) after 235.856s., which is quite strange as I don't see this timeout specified anywhere. correspondent non-kubetest2 test case has longer timeout and passes. So, this seems to be caused by kubetest2. Do you happen to know the reason? Did you see this error in other job logs?

elieser1101 · 2025-01-02T17:49:47Z

Have not seen that before, but seem like the kubetest2 job is missing the --timeout flag, we could try adding it
@bart0sh

bart0sh · 2025-01-03T15:14:19Z

@elieser1101 Thanks! Added --timeout option to the job configs: #34067
However, kubetest2 modifies it's value, it seems. I run eviction job locally this way:

kubetest2-gce --test=node --down=false -- --parallelism=1 --gcp-zone=us-west1-a  --repo-root=. --image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1.yaml --delete-instances=false --test-args='--container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"' --skip-regex='' --focus-regex='\[NodeFeature:Eviction\]' --timeout 300m

And it runs ginkgo this way:

I0103 13:18:22.467151  250332 node_e2e.go:195] Starting tests on "test-fedora-coreos-41-20241122-3-0-gcp-x86-64"
I0103 13:18:22.467281  250332 ssh.go:146] Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /home/ed/.ssh/google_compute_engine [email protected] -- sudo /bin/bash -c 'cd /tmp/node-e2e-20250103T131740 && set -o pipefail; timeout -k 30s 18000.000000s ./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  --no-color -v --timeout=180m ./e2e_node.test -- --system-spec-name= --system-spec-file= --extra-envs= --runtime-config= --v 4 --node-name=test-fedora-coreos-41-20241122-3-0-gcp-x86-64 --report-dir=/tmp/node-e2e-20250103T131740/results --report-prefix=fedora --image-description="fedora-coreos-41-20241122-3-0-gcp-x86-64" --kubelet-flags="--cluster-domain=cluster.local" --dns-domain="cluster.local" --prepull-images=false  --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}" 2>&1 | tee -i /tmp/node-e2e-20250103T131740/results/test-fedora-coreos-41-20241122-3-0-gcp-x86-64-ginkgo.log']

So, kubetest2 changes --timeout 300m to ginkgo's --timeout=180m for some reason. Do you have any idea why?

aojea · 2025-01-03T15:44:04Z

actually there are two timeouts there

./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  --no-color -v --timeout=180m

it seems is added in https://github.com/kubernetes/kubernetes/blob/master/hack/make-rules/test-e2e-node.sh

EDIT

@bart0sh you are not passing the flag to kubetest2 IIUIC , it has to be added before the --

bart0sh · 2025-01-03T18:19:34Z

@aojea > you are not passing the flag to kubetest2 IIUIC , it has to be added before the --

I'm not passing it to kubetest2 because kubetest2 doesn't have this flag:

$ kubetest2 gce --help 2>&1 |grep timeout
      --boskos-acquire-timeout-seconds int      How long (in seconds) to hang on a request to Boskos to acquire a resource before erroring. (default 300)

And I tested the fix, btw.

aojea · 2025-01-03T18:32:51Z

I'm not passing it to kubetest2 because kubetest2 doesn't have this flag:

is not this one ?

https://github.com/kubernetes-sigs/kubetest2/blob/22d5b1410bef09ae679fa5813a5f0d196b6079de/pkg/testers/node/node.go#L73

or are these changes not for e2e-node?

bart0sh · 2025-01-03T19:11:02Z

They are for e2e-node, but I couldn't use --timeout for kubetest2 when I run it manually. Am I missing something obvious here?

BTW, here is a job logs before and after
adding --timeout option to the job configuration. You can see there how a value of ginkgo's --timeout option has changed to 180m for some reason.

elieser1101 · 2025-01-03T19:34:26Z

So, kubetest2 changes --timeout 300m to ginkgo's --timeout=180m for some reason. Do you have any idea why?

I have seen that before, and I cant point to the WHY is that. but i think is more of test-e2e-node.sh and e2e_node/remote/remote.go change

is not this one ?
https://github.com/kubernetes-sigs/kubetest2/blob/22d5b1410bef09ae679fa5813a5f0d196b6079de/pkg/testers/node/node.go#L73

Yeah that is the flag we are using(tester flags), but then under the hood, the rabithole transforms the timeout in several places

When we pass to kubetest2 --timeout=300m we got this

Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /root/.ssh/google_compute_engine [email protected] -- sudo /bin/bash -c 'cd /tmp/node-e2e-20250103T183438 && set -o pipefail; timeout -k 30s 18000.000000s ./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  -skip=""""  --no-color -v --timeout=180m ./e2e_node.test -- --system-spec-name= --system-spec-file= --extra-envs= --runtime-config= --v 4 --node-name=test-fedora-coreos-41-20241122-3-0-gcp-x86-64 --report-dir=/tmp/node-e2e-20250103T183438/results --report-prefix=fedora --image-description="fedora-coreos-41-20241122-3-0-gcp-x86-64" --kubelet-flags="--cluster-domain=cluster.local" --dns-domain="cluster.local" --prepull-images=false  --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}" 2>&1 | tee -i /tmp/node-e2e-20250103T183438/results/test-fedora-coreos-41-20241122-3-0-gcp-x86-64-ginkgo.log']

Which results in a process timeout of 18000.000000s
also test-e2e-node.sh introduces -timeout=24h no matter if you pass other timeout
And finaly the timeout we specified but trimmed by the remote.go resulting in --timeout=180m

so setting up 300min -> (300 + 60) /2 = 180min passed to ginkgo

bart0sh · 2025-01-03T20:52:14Z

I hope that timeout recalculation has some reason. It's not obvious, but hopefully it exists :)

BTW, increasing timeout helped the job, but not fixed it. One test case still fails.

@kannon92 @elieser1101 Any ideas how to fix it?

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2024

SergeyKanzhelev added this to SIG Node CI/Test Board Aug 11, 2024

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Aug 11, 2024

kannon92 moved this from Triage to Issues - To do in SIG Node CI/Test Board Aug 21, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 21, 2024

k8s-ci-robot assigned elieser1101 Sep 5, 2024

This was referenced Dec 13, 2024

dra uses --ginkgo-flags #33948

Merged

add ginkgo-flags to node tester kubernetes-sigs/kubetest2#286

Merged

bart0sh mentioned this issue Dec 18, 2024

replace pull-crio-cgrpv2-imagefs-separatedisktest with kubetest2 job #33994

Merged

This was referenced Dec 18, 2024

splitfs-separate-disk-kubetest2: use empty skip-regex #34005

Merged

replace pull-crio-cgroupv2-splitfs-separate-disk with kubetest2 job #34021

Merged

elieser1101 mentioned this issue Dec 30, 2024

fix dra --label-filter test #34053

Merged

bart0sh mentioned this issue Jan 3, 2025

node_e2e: set timeout for kubetest2 eviction test #34067

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate CRI-O jobs away from `kubernetes_e2e.py` #32567

Migrate CRI-O jobs away from `kubernetes_e2e.py` #32567

saschagrunert commented May 6, 2024 •

edited

Loading

haircommander commented May 6, 2024

k8s-triage-robot commented Aug 4, 2024

saschagrunert commented Aug 5, 2024

kannon92 commented Aug 21, 2024

elieser1101 commented Sep 5, 2024

saschagrunert commented Sep 5, 2024

elieser1101 commented Sep 5, 2024

bart0sh commented Dec 13, 2024

bart0sh commented Dec 18, 2024

elieser1101 commented Dec 18, 2024

bart0sh commented Jan 2, 2025

elieser1101 commented Jan 2, 2025

bart0sh commented Jan 3, 2025

aojea commented Jan 3, 2025 •

edited

Loading

bart0sh commented Jan 3, 2025

aojea commented Jan 3, 2025

bart0sh commented Jan 3, 2025 •

edited

Loading

elieser1101 commented Jan 3, 2025 •

edited

Loading

bart0sh commented Jan 3, 2025

Migrate CRI-O jobs away from kubernetes_e2e.py #32567

Migrate CRI-O jobs away from kubernetes_e2e.py #32567

Comments

saschagrunert commented May 6, 2024 • edited Loading

haircommander commented May 6, 2024

k8s-triage-robot commented Aug 4, 2024

saschagrunert commented Aug 5, 2024

kannon92 commented Aug 21, 2024

elieser1101 commented Sep 5, 2024

saschagrunert commented Sep 5, 2024

elieser1101 commented Sep 5, 2024

bart0sh commented Dec 13, 2024

bart0sh commented Dec 18, 2024

elieser1101 commented Dec 18, 2024

bart0sh commented Jan 2, 2025

elieser1101 commented Jan 2, 2025

bart0sh commented Jan 3, 2025

aojea commented Jan 3, 2025 • edited Loading

bart0sh commented Jan 3, 2025

aojea commented Jan 3, 2025

bart0sh commented Jan 3, 2025 • edited Loading

elieser1101 commented Jan 3, 2025 • edited Loading

bart0sh commented Jan 3, 2025

Migrate CRI-O jobs away from `kubernetes_e2e.py` #32567

Migrate CRI-O jobs away from `kubernetes_e2e.py` #32567

saschagrunert commented May 6, 2024 •

edited

Loading

aojea commented Jan 3, 2025 •

edited

Loading

bart0sh commented Jan 3, 2025 •

edited

Loading

elieser1101 commented Jan 3, 2025 •

edited

Loading