-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelising in R efficiently #8
Comments
Splitting did not help, still caused a huge explosion of memory usage. Found these tutorials on parallelising in R:
Forking actually copies the entire R environment, not just the variables inside each loop. |
Packages:
|
|
Testing methods1.
|
Parallelising over groups in R (e.g. each phenotype's gene list) can use a ton of memory, because you can accidentally copy all objects used inside the loop, which multiplies your memory usage when working with large datasets.
Even on large-memory machines like the Threadripper (250Gb) this quickly gets overloaded bc each parallel process uses 17.6Gb of memory to store the entire
gene_data
obj over and over again.17.6 * 50 cores = 835Gb of memory! This means that parallelising chokes up the threadripper's memory and slows it down to a grinding halt, making processing the data even slower than if you had just single-threaded it. This is exactly what happened when i tried this version of the
gen_overlap
func:https://github.com/neurogenomics/MultiEWCE/blob/25d26a41096902607a4f595343f2f585dad9f819/R/gen_overlap.R#L49
A better way might be to split the data first into chunks, and then iterate over the chunks:
Related posts
https://stackoverflow.com/questions/19082794/speed-up-data-table-group-by-using-multiple-cores-and-parallel-programming
https://stackoverflow.com/questions/14759905/data-table-and-parallel-computing
👇 Contains some useful benchmarking
Rdatatable/data.table#2223
Killing zombie processes
This happens when you launch some paralleised func in R and then decide to stop it midway. A bunch of "zombie" processes are leftover.
Restarting the R session
Sometimes this works, other times not so much. Might not work when memory is so overloaded that you can't restart the R session.
via htop
Didn't seem to work.
Via
inline
Someone suggested this, but didn't seem to do anything for me on the Threadripper.
https://stackoverflow.com/questions/25388139/r-parallel-computing-and-zombie-processes
via
future
Using
future
instead ofparallel
might give me more control over this, but it has to done before launching the processes.futureverse/future#93
via Docker
To kill them, the only effective way I've found is to restart the container:
@Al-Murphy probably relevant to you too
The text was updated successfully, but these errors were encountered: