-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Java x-lang jobs are failing for Beam 2.47.0 RC3 due to failing to start the Python container #26576
Comments
I see following logged many times in worker-startup logs which might be related. /opt/apache/beam/boot: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /opt/apache/beam/boot) |
great catch! |
There's a related bug here: #24470 According to that bullseye base image had GLIB 2.31 while the SDK harness was linked against higher versions of GLIB. Same thing might be happening here. @lostluck @riteshghorse @jrmccluskey any idea how to resolve this ? Should we re-build the SDK harness container with GLIB 2.31 installed ? |
BTW this can be easily re-produced by running following. docker run -it --platform linux/amd64 --entrypoint '/opt/apache/beam/boot' us-central1-artifactregistry.gcr.io/google.com/dataflow-containers/worker/v1beta3/beam_python3.9_sdk:2.47.0 Which fails with the following error: /opt/apache/beam/boot: /lib/x86_64-linux-gnu/libc.so.6: version Thanks @bvolpato |
can confirm #26054 as the root cause here
|
Thanks. Also confirmed that the x-lang test passes when #26054 is reverted: https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-06_21_46_17-7894509046776240415;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22)) Forwarding to @lostluck to determine the next steps here. |
We also found out that the images built on Jenkins are fine but images built on HEAD may fail depending on the setup of the local machine. For example, Following passes: docker run -it --entrypoint '/opt/apache/beam/boot' gcr.io/apache-beam-testing/beam-sdk/beam_python3.9_sdk:0924840386f473e75324d645e0f0bd466e22dbad But following fails on my linux machine: (on HEAD) ./gradlew :sdks:python:container:py39:docker This explains why HEAD tests didn't fail since the PR was submitted. We currently build Docker images for the release in the local machine of the release manager. We need to update the release process to build Docker images in a standard place that is also consistent with the tests so that we can catch issues like this early and consistently. Filed #26578 to improve the release process. |
Can containers be fixed by changing the go compiler version on the local machine and rebuilding them or code changes to the branch are necessary? |
I am not sure I understand why tests running against the release branch didn't catch it. |
I believe @jrmccluskey is trying to re-build containers in a different setup. |
So updating to Go 1.20.2 wouldn't have caused that, since Go can only require at best the version of Glibc that's present on the machine doing the compilation. The problem is where/how the boot script was built, which will depend on whatever machine we're running the gradle commands on. Ultimately, the right solution here is that instead of compiling on the "local" machine, we're probably better off having the boot script built in a clean known environment, likely with CGO=0, and that will avoid bootscript glibc issues. |
Reducing the priority since we could unblock the 2.47.0 release by re-building containers in a different environment. |
Co-authored-by: lostluck <[email protected]>
Co-authored-by: lostluck <[email protected]>
What happened?
Seems like Java x-lang jobs are failing for Beam 2.47.0 RC3 due to not being able to start the Python SDK harness container.
Example:
Container: gcr.io/cloud-dataflow/v1beta3/beam_python3.9_sdk:2.47.0
Job: https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-06_01_31_45-3741497213019956539;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))
Errors: https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22dataflow_step%22%0Aresource.labels.job_id%3D%222023-05-06_01_31_45-3741497213019956539%22%0AlogName%3D%22projects%2Fapache-beam-testing%2Flogs%2Fdataflow.googleapis.com%252Fkubelet%22%0Aseverity%3E%3DERROR;timeRange=2023-05-06T08:14:07.849Z%2F2023-05-06T10:14:07.849Z;cursorTimestamp=2023-05-06T10:13:57.196540Z?project=apache-beam-testing
Error syncing pod, skipping" err="failed to "StartContainer" for "sdk-1-0" with CrashLoopBackOff: "back-off 10s restarting failed container=sdk-1-0 pod=df-pythondataframewordcount--05060131-tupl-harness-pnng_default(9d9011b47f48e0b652f8d16cf81e8f8c)"" pod="default/df-pythondataframewordcount--05060131-tupl-harness-pnng" podUID=9d9011b47f48e0b652f8d16cf81e8f8c
I'm getting the same error when running with the container overridden to use a clone in my private repo, so this is unlikely to be a GCR issue.
The same job works fine with Beam 2.46.0 so seems like there's some issue with Beam 2.47.0 artifacts.
https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-06_01_15_41-10489984441407428074;bottomTab=WORKER_LOGS;logsSeverity=INFO;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))
To reproduce, run the multi-lang quickstart job with a manual expansion service container and using Python 2.47.0 artifacts.
https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/
Creating this as a blocker for the ongoing RC since this breaks a feature that worked for the previous release.
Issue Priority
Priority: 3 (minor)
Issue Components
The text was updated successfully, but these errors were encountered: