-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sharding in WriteToFiles (tested for to_csv) #33612
base: master
Are you sure you want to change the base?
Conversation
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
Assigning reviewers. If you would like to opt out of this review, comment R: @liferoad for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Reminder, please take a look at this pr: @liferoad @ahmedabu98 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just one suggestion
@@ -522,7 +522,7 @@ class WriteToFiles(beam.PTransform): | |||
# Too many files will add memory pressure to the worker, so we let it be 20. | |||
MAX_NUM_WRITERS_PER_BUNDLE = 20 | |||
|
|||
DEFAULT_SHARDING = 5 | |||
DEFAULT_SHARDING = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? I think line 570 can just check if the input shards
is None instead of referring to self.shards > 1
Addresses #22923 by adding a special case for sharding with no destination, since I wasn't sure if sharding is applicable if there is a destination. Happy to rework this as needed.