-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU: add parallel restore of BO content to accelerate restore #2527
base: criu-dev
Are you sure you want to change the base?
Conversation
@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes: |
@rst0git No problem. We will add proper description in the next version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wweewrwer. Some minor nit picks, but overall the code looks good to me.
@rst0git @avagin @dayatsin-amd hi maintainers, thanks for your prior reviews and comments. We have fixed all the issues, as the following:
Please let us know if you have any further comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wweewrwer
@wweewrwer Would you be able to merge the fixup commits into the previous commits using |
6d2c563
to
0b97df8
Compare
@rst0git Thanks for your comment! I have merged the fixup commits into the previous commits using git rebase. Please let me know if you have any further comments. |
cb6b91d
to
37e3813
Compare
@rst0git @avagin We have pushed the V4 version of the PR, completing all mentioned issues since the last version. Specifically, we: (1) support multiple commands (from a single process), (2) support multiple processes restore, and (3) fix other minor issues mentioned. Details:
We have performed all the tests with the above changes. The PR can still bring 31% decrease for the restore latency in the case of single process, and achieves the same results for mutlti-process scenarios. Please let me know if you have any further comments. |
1f48cd3
to
c7ca1b3
Compare
Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently. |
I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression. Thank you for this patch! |
Sure. Thanks! |
Dear David @dayatsin-amd , just want to know are there any progress/results about the internal regression test. Thank you again for the assistance and happy new year btw :) |
Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one. Additionally, I see two potential issues:
|
Thank you for your comment.
In the case of GPU apps (AMD GPUs in our case), not everything in the restore blob is fast, and not all mappings are restored before switching into the restore blob. Some mappings are still restored within the restore blob (e.g., In our tests, the restore time of the restore blob even takes longer than the GPU buffer objects restore time (this can be easily reproduced using the same test application and approach as in this pull request). Here is the breakdown of our test. This is the motivation of the PR and we believe it is valuable to parallelize the restoration of GPU buffer objects with other CRIU restore operations for optimized restore latency.
This PR will not lead to sequential restoration issue. The optimization focuses on the single-process situation (common case), as shown in the following table. In other scenarios, it will turn to the original method. This is achieved with the new
And for a single process, its restore logic is the same as the original method except that it is offloaded to the main CRIU process.
It will not introduce new dependency. Currently the main CRIU process must wait for other processes to complete their memory state restoration before proceeding with additional restoration logic. Besides, most of the code is in separate files and related to AMD GPUs for optimized performance, and will not break anything in the main path. Please let me know if you have any further comments. |
I forgot that we moved restore of anon-vma-s to the restorer (91388fc). Now, I understand the problem. Thanks.
I think you misunderstand me here. I don't mean code dependencies. The problem here is that the amd plugin completely occupied the main criu process. It means if there will be another plugin with a similar post-fork hook, these plugin will not work together. If you want to introduce a hook handling events from child processes, it should be running in context of a separate thread. ps, I like all these detailed comments, but why all these details have not been explained on the commit messages. It would speed up the review process and in future, it would help to understand ideas behind the implementation. |
Currently, in the target process, device-related restore operations and other restore operations almost run sequentially. When the target process executes the corresponding CRIU hook functions, it can't perform other restore operations. However, for GPU applications, some device restore operations have no logical dependencies on other common restore operations and can be parallelized with other operations to speed up the process. Instead of launching a thread in child processes for parallelization, this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU process to handle these restore operations. This is because the restoration of memory state in the restore blob is one of the most time-consuming parts of all restore logic. The main CRIU process can easily parallelize these operations, whereas parallelizing in threads within child processes is challenging. - POST_FORKING *POST_FORKING: Hook to enable the main CRIU process to perform some restore operations of plugins. Signed-off-by: Yanning Yang <[email protected]>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not initialized. However, during the plugin restore procedure, there may be some common file operations used in multiple hooks. This patch moves `cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use `fdstore` to place these file operations. Signed-off-by: Yanning Yang <[email protected]>
Currently, parallel restore only focuses on the single-process situation. Therefore, it needs an interface to know if there is only one process to restore. This patch adds a `has_children` function in `pstree.h` and replaces some existing implementations with this function. Signed-off-by: Yanning Yang <[email protected]>
When enabling parallel restore, the target process and the main CRIU process need an IPC interface to communicate and transfer restore commands. This patch adds a Unix domain TCP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <[email protected]>
Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <[email protected]>
This patch implements the entire logic to enable the offloading of buffer object content restoration. The goal of this patch is to offload the buffer object content restoration to the main CRIU process so that this restoration can occur in parallel with other restoration logic (mainly the restoration of memory state in the restore blob, which is time-consuming) to speed up the restore phase. The restoration of buffer object content usually takes a significant amount of time for GPU applications, so parallelizing it with other operations can reduce the overall restore time. It has three parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the POST_FORKING hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process; the third stops the parallel thread in the RESUME_DEVICES_LATE hook. This optimization only focuses on the single-process situation (common case). In other scenarios, it will turn to the original method. This is achieved with the new `parallel_disabled` flag. Signed-off-by: Yanning Yang <[email protected]>
Signed-off-by: Yanning Yang <[email protected]>
c7ca1b3
to
41e7d7a
Compare
The previous suggestions have been updated, and both the regression test and related tests have been completed. Please let us know if you have any further comments. :) |
TL;DR:
This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.
The target issue:
In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in
CR_PLUGIN_HOOK__RESTORE_EXT_FILE
. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.The parallel restore approach in this PR:
The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook,
CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS
, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents inCR_PLUGIN_HOOK__RESTORE_EXT_FILE
and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs inCR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS
and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.Tests:
We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.
Results:
Settings:
CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Memory: DDR4, 2x8GB
GPU: AMD MI50
Disk: 512GB, Samsung SSD 860
Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1
Example program:
example.py
: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.Steps:
Install CRIU
Follow the standard CRIU installation process. Ensure you set the environment variable
CRIU_LIBS_DIR
to theplugins/amdgpu
path.Dump checkpoint image
Restore from disk
Test for sequential restore:
Test for parallel restore:
Restore from page cache
Install vmtouch for caching images:
Test: