Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelising in R efficiently #8

Closed
bschilder opened this issue Mar 21, 2023 · 4 comments
Closed

Parallelising in R efficiently #8

bschilder opened this issue Mar 21, 2023 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@bschilder
Copy link
Collaborator

bschilder commented Mar 21, 2023

Parallelising over groups in R (e.g. each phenotype's gene list) can use a ton of memory, because you can accidentally copy all objects used inside the loop, which multiplies your memory usage when working with large datasets.

gene_data <- HPOExplorer::load_phenotype_to_genes()
gene_data[,HPO_ID.LinkID:=paste(HPO_ID,LinkID,sep=".")]
nrow(gene_data) # 968,116
length(unique(gene_data$HPO_ID.LinkID)) #621,906

Even on large-memory machines like the Threadripper (250Gb) this quickly gets overloaded bc each parallel process uses 17.6Gb of memory to store the entire gene_data obj over and over again.
17.6 * 50 cores = 835Gb of memory! This means that parallelising chokes up the threadripper's memory and slows it down to a grinding halt, making processing the data even slower than if you had just single-threaded it. This is exactly what happened when i tried this version of the gen_overlap func:
https://github.com/neurogenomics/MultiEWCE/blob/25d26a41096902607a4f595343f2f585dad9f819/R/gen_overlap.R#L49

A better way might be to split the data first into chunks, and then iterate over the chunks:

 split.data.table <- utils::getFromNamespace("split.data.table","data.table")
gene_data_split <- split.data.table(x = gene_data,
                                    by ="HPO_ID.LinkID")

Related posts

https://stackoverflow.com/questions/19082794/speed-up-data-table-group-by-using-multiple-cores-and-parallel-programming
https://stackoverflow.com/questions/14759905/data-table-and-parallel-computing

👇 Contains some useful benchmarking
Rdatatable/data.table#2223

Killing zombie processes

This happens when you launch some paralleised func in R and then decide to stop it midway. A bunch of "zombie" processes are leftover.

Restarting the R session

Sometimes this works, other times not so much. Might not work when memory is so overloaded that you can't restart the R session.

via htop

Didn't seem to work.

Via inline

Someone suggested this, but didn't seem to do anything for me on the Threadripper.

https://stackoverflow.com/questions/25388139/r-parallel-computing-and-zombie-processes

via future

Using future instead of parallel might give me more control over this, but it has to done before launching the processes.
futureverse/future#93

via Docker

To kill them, the only effective way I've found is to restart the container:

docker stop <id>
docker start <id>

@Al-Murphy probably relevant to you too

@bschilder
Copy link
Collaborator Author

bschilder commented Mar 21, 2023

Splitting did not help, still caused a huge explosion of memory usage.

Found these tutorials on parallelising in R:
https://yxue-me.com/post/2019-05-12-a-glossary-of-parallel-computing-packages-in-r-2019/
https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html

future tutorial:
https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html

Forking actually copies the entire R environment, not just the variables inside each loop.

@bschilder
Copy link
Collaborator Author

Packages:

  • parallel
  • future
  • future.apply
  • BiocParallel
  • clusterApply
  • furrr
  • foreach

@bschilder
Copy link
Collaborator Author

furrr seems to have some useful methods, like future_map2 which has a built in progress bar and works directly with future:
https://furrr.futureverse.org/
http://zevross.com/blog/2019/02/12/dramatically-speed-up-your-r-purrr-functions-with-the-furrr-package/

@bschilder
Copy link
Collaborator Author

bschilder commented Mar 21, 2023

Testing methods

1. parallel::mclapply

This is the least efficient method, as I include the gene_data_split inside the loop which increases the memory usage by ~1Gb every 2 seconds or so.

  overlap <- parallel::mclapply(seq_len(length(gene_data_split)),
                                function(i){
    list_name <- names(gene_data_split)[[i]]

    d <- gene_data_split[[i]]
    gen_overlap_test(ct_genes = ct_genes,
                     list_name = list_name,
                     dgenes = unique(d$Gene),
                     long_format = long_format,
                     bg = bg,
                     verbose = FALSE)
  }, mc.cores = cores, mc.) |>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

2. furrr::future_map2

This is more efficient due to not passing gene_data_split into the loop. But memory usage is still increasing by ~.3Gb every 2 seconds or so.

 future::plan(future::multisession, workers = cores)
  overlap <- furrr::future_map2(.x = gene_data_split,
                     .y = names(gene_data_split),
                     .progress = TRUE,
                     .f = function(.x,.y){
                       gen_overlap_test(ct_genes = ct_genes,
                                        list_name = .y,
                                        dgenes = unique(.x$Gene),
                                        long_format = long_format,
                                        bg = bg,
                                        verbose = FALSE)
                     })|>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column) 

3. furrr::future_map

Only passing in .x, and not .y. So is a bit more efficient than future_map. Increases mem usage by ~.1-.3Gb/2s.

 future::plan(future::multisession, workers = cores)
  overlap <- furrr::future_map(.x = gene_data_split,
                     .progress = TRUE,
                     .f = function(.x){
                       gen_overlap_test(ct_genes = ct_genes,
                                        dgenes = unique(.x$Gene),
                                        long_format = long_format,
                                        bg = bg,
                                        verbose = FALSE)
                     })|>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

Further improvements in efficiency were achieved through reducing the size of the ct_genes obj (only store gene names, not quantiles), and not storing the union gene set in the returned results object.
This gets the memory usage to only grow at ~.1Gb/5-14 seconds. Much better! At this rate, I might actually be able to finish all iterations before memory maxes out at >252Gb. I also noticed this rate tends to get a bit slower as time goes on.

4. data.table

A non-parallel option that is still very fast and does not consume increasing amounts of mem over time.
Takes advantage of the by and performs the grouped function calls within-object, which is more efficient that writing an lapply across a list of chunks.

overlap <- gene_data[,gen_overlap_test(ct_genes = ct_genes,
                                       list_name = get(list_name_column),
                                       dgenes = Gene,
                                       long_format = long_format,
                                       bg = bg),
                     by=c(list_name_column)]

Parallelising grouped operating within a data.table has been discussed as well:
Rdatatable/data.table#5376
Rdatatable/data.table#5077
Rdatatable/data.table#4284
Rdatatable/data.table#4077

It's also worth noting that you can use keyby instead of by. This seems to have more to do with how the table is ordered afterwards (via setting the key) than any performance diffs when doing grouped operations:
https://stackoverflow.com/questions/61755966/data-table-summarizing-data-difference-between-by-and-keyby
Rdatatable/data.table#5602

Killing zombies

Using future-based packages like furrr opens up the possibility of freeing up the zombified cores by simply running this command. This means i don't have to restart my entire R session each time!

fut <- future::plan(future::multisession, workers = cores)

future::resetWorkers(fut)

This even works if I forgot to save the plan as a variable

future::plan(future::multisession, workers = cores)

future::resetWorkers([fut](future::plan(future::multisession, workers = cores)))

Shared memory

Another thing to keep in mind here is that the threadripper is different from other kinds of clusters in that it is a shared memory resource. This complicates things and more care needs to be taken to avoid issues due to this.

@bschilder bschilder self-assigned this Mar 21, 2023
@bschilder bschilder added enhancement New feature or request documentation Improvements or additions to documentation labels Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant