You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's not entirely obvious to new users what value should be specified for chunksize, yet its choice is vital to getting good performance. It would be great if we could add to the docs some recommendations (and possibly examples) of how to choose a good chunksize value. Specifically, we could list a few basic recommendations:
Specify it as a ratio of the total number of rows (e.g. nrows / 10)
When the data is very big, limit it to a certain maximum size (e.g. whether data is 20GB or 300GB, pick 10_000_000 rows)
If the same analysis will be rerun many times, benchmark different chunksize choices and pick the one that is fastest but also doesn't cause OOM errors
etc.
Showing an example of how to do that benchmarking would be really useful to users who are lost on what chunksize to pick.
The text was updated successfully, but these errors were encountered:
It's not entirely obvious to new users what value should be specified for
chunksize
, yet its choice is vital to getting good performance. It would be great if we could add to the docs some recommendations (and possibly examples) of how to choose a goodchunksize
value. Specifically, we could list a few basic recommendations:nrows / 10
)10_000_000
rows)chunksize
choices and pick the one that is fastest but also doesn't cause OOM errorsShowing an example of how to do that benchmarking would be really useful to users who are lost on what
chunksize
to pick.The text was updated successfully, but these errors were encountered: