CUDA out of memory when saving checkpoint #1199
-
Beta Was this translation helpful? Give feedback.
Replies: 7 comments
-
From your screenshoot, the program seems running a validation step. The checkpoints should has been saved. You can take a look at the working directory to see if some Do you mean the memory keep increasing during the progress propagates? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply.
The .pth exists in my file.
I just run the source code from author, thus the metric is PSNR/SSIM.
…------------------ 原始邮件 ------------------
发件人: "open-mmlab/mmediting" ***@***.***>;
发送时间: 2022年6月4日(星期六) 晚上9:16
***@***.***>;
***@***.******@***.***>;
主题: Re: [open-mmlab/mmediting] CUDA out of memory when saving checkpoint (Issue #911)
From your screenshoot, the program seems running a validation step. The checkpoints should has been saved. You can take a look at the working directory to see if some .pth file exists.
What metric do you use for validation?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds.py We can try reproduce your problem |
Beta Was this translation helpful? Give feedback.
-
BTW, if the memory bumps when validation starts, and keep nearly a constant, it should be normal because in current implementation, validation is a standalone function launched apart from training. So it will take some memory. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply
"validation is a standalone function launched apart from training. So it will take some memory."—— The increased memory produced by validation has been existing later, I wanna to know whether it is normal?
…------------------ 原始邮件 ------------------
发件人: "open-mmlab/mmediting" ***@***.***>;
发送时间: 2022年6月4日(星期六) 晚上9:27
***@***.***>;
***@***.******@***.***>;
主题: Re: [open-mmlab/mmediting] CUDA out of memory when saving checkpoint (Issue #911)
BTW, if the memory bumps when validation starts, and keep nearly a constant, it should be normal because in current implementation, validation is a standalone function launched apart from training. So it will take some memory.
If the memory keep increasing as the progress bar progresses, there might be some memory leak, e.g. some result tensor not detached from the computation graph and it keeps whole computation graph, which takes memory.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I wanna to ask for help about how to modify current implementation for validation without taking some memory. |
Beta Was this translation helpful? Give feedback.
-
Sorry for some late as we are working on some heavy development. As a very simple workaround, you can just disable evaluation by setting
As checkpoint saving works anyway, you can make some analysis after the training or in an offline way by using another GPU. |
Beta Was this translation helpful? Give feedback.
Sorry for some late as we are working on some heavy development.
As a very simple workaround, you can just disable evaluation by setting
interval
of the evaluation larger than the total iter of training.As checkpoint saving works anyway, you can make some analysis after the training or in an offline way by using another GPU.