Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sharding in WriteToFiles (tested for to_csv) #33612

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

langner
Copy link

@langner langner commented Jan 15, 2025

Addresses #22923 by adding a special case for sharding with no destination, since I wasn't sure if sharding is applicable if there is a destination. Happy to rework this as needed.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @liferoad for label python.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

Reminder, please take a look at this pr: @liferoad @ahmedabu98

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just one suggestion

@@ -522,7 +522,7 @@ class WriteToFiles(beam.PTransform):
# Too many files will add memory pressure to the worker, so we let it be 20.
MAX_NUM_WRITERS_PER_BUNDLE = 20

DEFAULT_SHARDING = 5
DEFAULT_SHARDING = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary? I think line 570 can just check if the input shards is None instead of referring to self.shards > 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants