-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Previously working Dataflow jobs started crashing when sharding #10971
Comments
The validation/test is only 52,880 examples and also 512 shards, so maybe the training shards get really big and down a Dataflow worker somehow? |
@carlthome could you provide more details so that I could look into it?
|
Last known time the DatasetBuilder worked on Dataflow was Dec 6, 2023, with the following requirements.txt
Running the same today results in
Perhaps there has been unintended changes on the Dataflow side, rather than in the Beam pipeline. This is in the Dataflow job logs:
|
Maybe related: https://issuetracker.google.com/issues/368255186 |
I have a DatasetBuilder for terrabytes of audio that used to work and ran to completion on Dataflow, but have stopped working. The code is unchanged. The data shouldn't have changed. We've been unable to debug this so I'm looking for whether there's unexpected changes in how
tensorflow-datasets
rely on Beam.We're using Apache Beam Python 3.10 SDK 2.60.0 on Dataflow V2 and strangely the test and validation completes but not training. Is there some size limitation for the sharding logic (433,890 serialized_examples, 1,024 NumberOfShards, 512 written_shards)?
The text was updated successfully, but these errors were encountered: