Previously working Dataflow jobs started crashing when sharding #10971

carlthome · 2025-01-07T12:10:15Z

I have a DatasetBuilder for terrabytes of audio that used to work and ran to completion on Dataflow, but have stopped working. The code is unchanged. The data shouldn't have changed. We've been unable to debug this so I'm looking for whether there's unexpected changes in how tensorflow-datasets rely on Beam.

We're using Apache Beam Python 3.10 SDK 2.60.0 on Dataflow V2 and strangely the test and validation completes but not training. Is there some size limitation for the sharding logic (433,890 serialized_examples, 1,024 NumberOfShards, 512 written_shards)?

train_write/GroupShards

Workflow failed. Causes: S65:train_write/GroupShards/Read+train_write/GetIdsOfNonEmptyShards/Keys+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/KeyWithVoid+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/Combine/Partial+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: 

      Root cause: Timed out waiting for an update from the worker.
      Worker ID: ...

The text was updated successfully, but these errors were encountered:

carlthome · 2025-01-07T12:11:54Z

The validation/test is only 52,880 examples and also 512 shards, so maybe the training shards get really big and down a Dataflow worker somehow?

fineguy · 2025-01-08T09:20:00Z

@carlthome could you provide more details so that I could look into it?

When was the last time that preparation of your dataset worked and the first time you noticed that it doesn't?
What versions of tensorflow-datasets were you using?
Did you change any other libraries?

carlthome · 2025-01-08T10:27:28Z

Last known time the DatasetBuilder worked on Dataflow was Dec 6, 2023, with the following requirements.txt

apache-beam[gcp]==2.48.0
google-cloud-bigquery[pandas]==3.11.3
google-cloud-storage==2.10.0
librosa==0.8.0
pandas==2.0.3
pandas-gbq==0.19.2
Pillow==9.5.0
tensorflow==2.12.0
tensorflow-datasets==4.9.3
tensorflow-hub==0.14.0
transformers==4.32.1

Running the same today results in

Workflow failed. Causes: S65:train_write/GroupShards/Read+train_write/GetIdsOfNonEmptyShards/Keys+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/KeyWithVoid+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/Combine/Partial+train_write/CollectIdsOfNonEmptyShards/CollectIdsOfNonEmptyShards/CombinePerKey/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:

Perhaps there has been unintended changes on the Dataflow side, rather than in the Beam pipeline. This is in the Dataflow job logs:

*** SIGSEGV (@0x100), see go/stacktraces#s15 received by PID 15 (TID 54) on cpu 1; stack trace: ***

PC: @     0x57aff4c24be0  (unknown)  absl::Mutex::Lock()
    @     0x57aff749b55f       2304  FailureSignalHandler()
    @     0x7a0ed308c9a0    3144200  (unknown)
    @     0x57aff4c24be0          8  (unknown)
    @     0x57aff65d4704        208  dist_proc::dax::workflow::FnApiProcessBundleOperator::EncodeAndOutputElement()
    @     0x57aff662e8ea         16  dist_proc::dax::workflow::TrivialFetchAndFilterSideInputsFn::ProcessInput()
    @     0x57aff65d87db         96  dist_proc::dax::workflow::FnApiProcessBundleOperator::EncodeAndOutputElementIfSideInputIsReady()
    @     0x57aff65d8613        432  dist_proc::dax::workflow::FnApiProcessBundleOperator::Process()
    @     0x57aff52904c1         80  dist_proc::dax::workflow::FanOutOperator::Process()
    @     0x57aff65cbd38        208  dist_proc::dax::workflow::FnApiReadOperator::Read()
    @     0x57aff528d236        304  dist_proc::dax::workflow::ReadOperator::Process()
    @     0x57aff5cfc72f        192  dist_proc::dax::workflow::GraphWorkExecutor::Execute()
    @     0x57aff663da53        496  dist_proc::dax::workflow::InstructionGraphExecutor::Run()
    @     0x57aff52f7eba        560  dist_proc::dax::workflow::ParallelWorkflowWorkerTask::ProcessWork()
    @     0x57aff521bdae         64  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x57aff6864983         80  absl::internal_any_invocable::LocalInvoker<>()
    @     0x57aff59034cd        272  Thread::ThreadBody()
    @     0x7a0ed30844e8        176  start_thread
    @     0x7a0ed2ef922d  (unknown)  clone

carlthome · 2025-01-08T10:37:38Z

Maybe related: https://issuetracker.google.com/issues/368255186

carlthome added the bug Something isn't working label Jan 7, 2025

fineguy self-assigned this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Previously working Dataflow jobs started crashing when sharding #10971

Previously working Dataflow jobs started crashing when sharding #10971

carlthome commented Jan 7, 2025

carlthome commented Jan 7, 2025

fineguy commented Jan 8, 2025

carlthome commented Jan 8, 2025

carlthome commented Jan 8, 2025

Previously working Dataflow jobs started crashing when sharding #10971

Previously working Dataflow jobs started crashing when sharding #10971

Comments

carlthome commented Jan 7, 2025

carlthome commented Jan 7, 2025

fineguy commented Jan 8, 2025

carlthome commented Jan 8, 2025

carlthome commented Jan 8, 2025