Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Java x-lang jobs are failing for Beam 2.47.0 RC3 due to failing to start the Python container #26576

Closed
15 tasks
chamikaramj opened this issue May 6, 2023 · 12 comments
Assignees
Labels

Comments

@chamikaramj
Copy link
Contributor

chamikaramj commented May 6, 2023

What happened?

Seems like Java x-lang jobs are failing for Beam 2.47.0 RC3 due to not being able to start the Python SDK harness container.

Example:

Container: gcr.io/cloud-dataflow/v1beta3/beam_python3.9_sdk:2.47.0

Job: https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-06_01_31_45-3741497213019956539;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))

Errors: https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22dataflow_step%22%0Aresource.labels.job_id%3D%222023-05-06_01_31_45-3741497213019956539%22%0AlogName%3D%22projects%2Fapache-beam-testing%2Flogs%2Fdataflow.googleapis.com%252Fkubelet%22%0Aseverity%3E%3DERROR;timeRange=2023-05-06T08:14:07.849Z%2F2023-05-06T10:14:07.849Z;cursorTimestamp=2023-05-06T10:13:57.196540Z?project=apache-beam-testing

Error syncing pod, skipping" err="failed to "StartContainer" for "sdk-1-0" with CrashLoopBackOff: "back-off 10s restarting failed container=sdk-1-0 pod=df-pythondataframewordcount--05060131-tupl-harness-pnng_default(9d9011b47f48e0b652f8d16cf81e8f8c)"" pod="default/df-pythondataframewordcount--05060131-tupl-harness-pnng" podUID=9d9011b47f48e0b652f8d16cf81e8f8c

I'm getting the same error when running with the container overridden to use a clone in my private repo, so this is unlikely to be a GCR issue.

The same job works fine with Beam 2.46.0 so seems like there's some issue with Beam 2.47.0 artifacts.

https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-06_01_15_41-10489984441407428074;bottomTab=WORKER_LOGS;logsSeverity=INFO;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))

To reproduce, run the multi-lang quickstart job with a manual expansion service container and using Python 2.47.0 artifacts.

https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/

Creating this as a blocker for the ongoing RC since this breaks a feature that worked for the previous release.

Issue Priority

Priority: 3 (minor)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@github-actions github-actions bot added the P3 label May 6, 2023
@chamikaramj chamikaramj self-assigned this May 6, 2023
@chamikaramj chamikaramj added this to the 2.47.0 Release milestone May 6, 2023
@chamikaramj chamikaramj changed the title [Bug]: Java x-lang jobs are failing for Beam 2.47.0 RC3 due to failing to download the Python container [Bug]: Java x-lang jobs are failing for Beam 2.47.0 RC3 due to failing to start the Python container May 6, 2023
@chamikaramj chamikaramj removed the P3 label May 6, 2023
@chamikaramj
Copy link
Contributor Author

@brucearctor
Copy link
Contributor

great catch!

@chamikaramj
Copy link
Contributor Author

There's a related bug here: #24470

According to that bullseye base image had GLIB 2.31 while the SDK harness was linked against higher versions of GLIB. Same thing might be happening here.

@lostluck @riteshghorse @jrmccluskey any idea how to resolve this ? Should we re-build the SDK harness container with GLIB 2.31 installed ?

@chamikaramj
Copy link
Contributor Author

BTW this can be easily re-produced by running following.

docker run -it --platform linux/amd64 --entrypoint '/opt/apache/beam/boot' us-central1-artifactregistry.gcr.io/google.com/dataflow-containers/worker/v1beta3/beam_python3.9_sdk:2.47.0

Which fails with the following error:

/opt/apache/beam/boot: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /opt/apache/beam/boot) /opt/apache/beam/boot: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by /opt/apache/beam/boot)

Thanks @bvolpato

@bvolpato
Copy link
Contributor

bvolpato commented May 7, 2023

can confirm #26054 as the root cause here

$ git checkout release-2.47.0
$ ./gradlew :sdks:python:container:py39:docker
$ docker run -it --entrypoint '/opt/apache/beam/boot' docker.io/apache/beam_python3.9_sdk:2.47.0.dev
/opt/apache/beam/boot: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /opt/apache/beam/boot)
/opt/apache/beam/boot: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /opt/apache/beam/boot)

$ git revert 7ee74d2bf7338e82d35e4429e6d21decc1097621

$ ./gradlew :sdks:python:container:py39:docker
$ docker run -it --entrypoint '/opt/apache/beam/boot' docker.io/apache/beam_python3.9_sdk:2.47.0.dev
2023/05/07 00:58:56 No id provided.

@chamikaramj
Copy link
Contributor Author

Thanks. Also confirmed that the x-lang test passes when #26054 is reverted: https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-06_21_46_17-7894509046776240415;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))

Forwarding to @lostluck to determine the next steps here.
Probably we should revert #26054 from the release branch in the short term and re-build containers.

@chamikaramj
Copy link
Contributor Author

chamikaramj commented May 7, 2023

We also found out that the images built on Jenkins are fine but images built on HEAD may fail depending on the setup of the local machine.

For example,

Following passes:

docker run -it --entrypoint '/opt/apache/beam/boot' gcr.io/apache-beam-testing/beam-sdk/beam_python3.9_sdk:0924840386f473e75324d645e0f0bd466e22dbad

But following fails on my linux machine:

(on HEAD)

./gradlew :sdks:python:container:py39:docker
docker run -it --entrypoint '/opt/apache/beam/boot' apache/beam_python3.9_sdk:2.48.0.dev

This explains why HEAD tests didn't fail since the PR was submitted.

We currently build Docker images for the release in the local machine of the release manager. We need to update the release process to build Docker images in a standard place that is also consistent with the tests so that we can catch issues like this early and consistently. Filed #26578 to improve the release process.

@tvalentyn
Copy link
Contributor

We also found out that the images built on Jenkins are fine but images built on HEAD may fail depending on the setup of the local machine.

Can containers be fixed by changing the go compiler version on the local machine and rebuilding them or code changes to the branch are necessary?

@tvalentyn
Copy link
Contributor

I am not sure I understand why tests running against the release branch didn't catch it.

@chamikaramj
Copy link
Contributor Author

I believe @jrmccluskey is trying to re-build containers in a different setup.

@lostluck
Copy link
Contributor

lostluck commented May 8, 2023

So updating to Go 1.20.2 wouldn't have caused that, since Go can only require at best the version of Glibc that's present on the machine doing the compilation. The problem is where/how the boot script was built, which will depend on whatever machine we're running the gradle commands on.

Ultimately, the right solution here is that instead of compiling on the "local" machine, we're probably better off having the boot script built in a clean known environment, likely with CGO=0, and that will avoid bootscript glibc issues.

@chamikaramj
Copy link
Contributor Author

Reducing the priority since we could unblock the 2.47.0 release by re-building containers in a different environment.

@chamikaramj chamikaramj added P1 and removed P0 labels May 10, 2023
lostluck added a commit to lostluck/beam that referenced this issue May 11, 2023
riteshghorse pushed a commit that referenced this issue May 11, 2023
cushon pushed a commit to cushon/beam that referenced this issue May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants