Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previously working Dataflow jobs started crashing when sharding #10971

Open
carlthome opened this issue Jan 7, 2025 · 4 comments
Open

Previously working Dataflow jobs started crashing when sharding #10971

carlthome opened this issue Jan 7, 2025 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@carlthome
Copy link
Contributor

I have a DatasetBuilder for terrabytes of audio that used to work and ran to completion on Dataflow, but have stopped working. The code is unchanged. The data shouldn't have changed. We've been unable to debug this so I'm looking for whether there's unexpected changes in how tensorflow-datasets rely on Beam.

We're using Apache Beam Python 3.10 SDK 2.60.0 on Dataflow V2 and strangely the test and validation completes but not training. Is there some size limitation for the sharding logic (433,890 serialized_examples, 1,024 NumberOfShards, 512 written_shards)?

train_write/GroupShards
Workflow failed. Causes: S65:train_write/GroupShards/Read+train_write/GetIdsOfNonEmptyShards/Keys+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/KeyWithVoid+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/Combine/Partial+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: 

      Root cause: Timed out waiting for an update from the worker.
      Worker ID: ...
@carlthome carlthome added the bug Something isn't working label Jan 7, 2025
@carlthome
Copy link
Contributor Author

The validation/test is only 52,880 examples and also 512 shards, so maybe the training shards get really big and down a Dataflow worker somehow?

@fineguy
Copy link
Collaborator

fineguy commented Jan 8, 2025

@carlthome could you provide more details so that I could look into it?

  • When was the last time that preparation of your dataset worked and the first time you noticed that it doesn't?
  • What versions of tensorflow-datasets were you using?
  • Did you change any other libraries?

@fineguy fineguy self-assigned this Jan 8, 2025
@carlthome
Copy link
Contributor Author

Last known time the DatasetBuilder worked on Dataflow was Dec 6, 2023, with the following requirements.txt

apache-beam[gcp]==2.48.0
google-cloud-bigquery[pandas]==3.11.3
google-cloud-storage==2.10.0
librosa==0.8.0
pandas==2.0.3
pandas-gbq==0.19.2
Pillow==9.5.0
tensorflow==2.12.0
tensorflow-datasets==4.9.3
tensorflow-hub==0.14.0
transformers==4.32.1

Running the same today results in

Workflow failed. Causes: S65:train_write/GroupShards/Read+train_write/GetIdsOfNonEmptyShards/Keys+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/KeyWithVoid+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/Combine/Partial+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:

Perhaps there has been unintended changes on the Dataflow side, rather than in the Beam pipeline. This is in the Dataflow job logs:

*** SIGSEGV (@0x100), see go/stacktraces#s15 received by PID 15 (TID 54) on cpu 1; stack trace: ***

PC: @     0x57aff4c24be0  (unknown)  absl::Mutex::Lock()
    @     0x57aff749b55f       2304  FailureSignalHandler()
    @     0x7a0ed308c9a0    3144200  (unknown)
    @     0x57aff4c24be0          8  (unknown)
    @     0x57aff65d4704        208  dist_proc::dax::workflow::FnApiProcessBundleOperator::EncodeAndOutputElement()
    @     0x57aff662e8ea         16  dist_proc::dax::workflow::TrivialFetchAndFilterSideInputsFn::ProcessInput()
    @     0x57aff65d87db         96  dist_proc::dax::workflow::FnApiProcessBundleOperator::EncodeAndOutputElementIfSideInputIsReady()
    @     0x57aff65d8613        432  dist_proc::dax::workflow::FnApiProcessBundleOperator::Process()
    @     0x57aff52904c1         80  dist_proc::dax::workflow::FanOutOperator::Process()
    @     0x57aff65cbd38        208  dist_proc::dax::workflow::FnApiReadOperator::Read()
    @     0x57aff528d236        304  dist_proc::dax::workflow::ReadOperator::Process()
    @     0x57aff5cfc72f        192  dist_proc::dax::workflow::GraphWorkExecutor::Execute()
    @     0x57aff663da53        496  dist_proc::dax::workflow::InstructionGraphExecutor::Run()
    @     0x57aff52f7eba        560  dist_proc::dax::workflow::ParallelWorkflowWorkerTask::ProcessWork()
    @     0x57aff521bdae         64  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x57aff6864983         80  absl::internal_any_invocable::LocalInvoker<>()
    @     0x57aff59034cd        272  Thread::ThreadBody()
    @     0x7a0ed30844e8        176  start_thread
    @     0x7a0ed2ef922d  (unknown)  clone

@carlthome
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants