Improvement of unbalanced datasets in multiprocessing #47

maxgalli · 2020-06-14T11:30:01Z

As it was noticed during the last benchmark tests run, the treatment of unbalanced datasets is suboptimal when running with multiprocessing enabled if one of the RDataFrames is built on top of a dataset whose size is much bigger than the others, the worker that process it end up creating a bottleneck for the entire analysis. Several ways (to be investigated and implemented separately) can fix this issue:

combine the usage of multiprocessing and multithreading: detect in advance the larger datasets and split the workers that get to process these into multiple threads; in order not to increase the number of cores used, the overall number of workers decreases;
using only multiprocessing: detect in advance the larger datasets and split them into different RDataFrames, so that they are taken by different workers; the results can be easily merged at the end to get the proper histograms; this solution also requires something to check that the largest RDataFrames are the first ones sent to the workers.

maxgalli self-assigned this Jun 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of unbalanced datasets in multiprocessing #47

Improvement of unbalanced datasets in multiprocessing #47

maxgalli commented Jun 14, 2020

Improvement of unbalanced datasets in multiprocessing #47

Improvement of unbalanced datasets in multiprocessing #47

Comments

maxgalli commented Jun 14, 2020