Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add some recommendations on table partitioning #40

Open
jpsamaroo opened this issue Aug 8, 2023 · 0 comments
Open

docs: Add some recommendations on table partitioning #40

jpsamaroo opened this issue Aug 8, 2023 · 0 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@jpsamaroo
Copy link
Member

It's not entirely obvious to new users what value should be specified for chunksize, yet its choice is vital to getting good performance. It would be great if we could add to the docs some recommendations (and possibly examples) of how to choose a good chunksize value. Specifically, we could list a few basic recommendations:

  • Specify it as a ratio of the total number of rows (e.g. nrows / 10)
  • When the data is very big, limit it to a certain maximum size (e.g. whether data is 20GB or 300GB, pick 10_000_000 rows)
  • If the same analysis will be rerun many times, benchmark different chunksize choices and pick the one that is fastest but also doesn't cause OOM errors
  • etc.

Showing an example of how to do that benchmarking would be really useful to users who are lost on what chunksize to pick.

@jpsamaroo jpsamaroo added documentation Improvements or additions to documentation good first issue Good for newcomers labels Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant