CUDA out of memory when saving checkpoint #1199

TonJ24 · 2022-06-04T12:13:31Z

TonJ24
Jun 4, 2022

I meet a issue that happened when saving the checkpoint.
Specifically, the cuda memory would keep increasing when saving the checkpoint.

I wanna to ask for help about how to resolve it.
Thanks very much.

Answered by wangruohui

Jun 9, 2022

Sorry for some late as we are working on some heavy development.

As a very simple workaround, you can just disable evaluation by setting interval of the evaluation larger than the total iter of training.

evaluation = dict(interval=5000, save_image=False, gpu_collect=True)

As checkpoint saving works anyway, you can make some analysis after the training or in an offline way by using another GPU.

View full answer

wangruohui · 2022-06-04T13:16:09Z

wangruohui
Jun 4, 2022
Collaborator

From your screenshoot, the program seems running a validation step. The checkpoints should has been saved. You can take a look at the working directory to see if some .pth file exists.

Do you mean the memory keep increasing during the progress propagates?
What metric do you use for validation?
Are you running a standard configuration file?

0 replies

TonJ24 · 2022-06-04T13:19:11Z

TonJ24
Jun 4, 2022
Author

Thanks for your reply. The .pth exists in my file. I just run the source code from author, thus the metric is PSNR/SSIM.

…

------------------ 原始邮件 ------------------ 发件人: "open-mmlab/mmediting" ***@***.***>; 发送时间: 2022年6月4日(星期六) 晚上9:16 ***@***.***>; ***@***.******@***.***>; 主题: Re: [open-mmlab/mmediting] CUDA out of memory when saving checkpoint (Issue #911) From your screenshoot, the program seems running a validation step. The checkpoints should has been saved. You can take a look at the working directory to see if some .pth file exists. What metric do you use for validation? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

wangruohui · 2022-06-04T13:23:16Z

wangruohui
Jun 4, 2022
Collaborator

realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds.py
or
realbasicvsr_wogan_c64b20_2x30x8_lr1e-4_300k_reds.py

We can try reproduce your problem

0 replies

wangruohui · 2022-06-04T13:27:24Z

wangruohui
Jun 4, 2022
Collaborator

BTW, if the memory bumps when validation starts, and keep nearly a constant, it should be normal because in current implementation, validation is a standalone function launched apart from training. So it will take some memory.
If the memory keep increasing as the progress bar progresses, there might be some memory leak, e.g. some result tensor not detached from the computation graph and it keeps whole computation graph, which takes memory.

0 replies

TonJ24 · 2022-06-04T13:32:36Z

TonJ24
Jun 4, 2022
Author

Thanks for your reply "validation is a standalone function launched apart from training. So it will take some memory."—— The increased memory produced by validation has been existing later, I wanna to know whether it is normal?

…

------------------ 原始邮件 ------------------ 发件人: "open-mmlab/mmediting" ***@***.***>; 发送时间: 2022年6月4日(星期六) 晚上9:27 ***@***.***>; ***@***.******@***.***>; 主题: Re: [open-mmlab/mmediting] CUDA out of memory when saving checkpoint (Issue #911) BTW, if the memory bumps when validation starts, and keep nearly a constant, it should be normal because in current implementation, validation is a standalone function launched apart from training. So it will take some memory. If the memory keep increasing as the progress bar progresses, there might be some memory leak, e.g. some result tensor not detached from the computation graph and it keeps whole computation graph, which takes memory. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

TonJ24 · 2022-06-05T08:09:27Z

TonJ24
Jun 5, 2022
Author

I wanna to ask for help about how to modify current implementation for validation without taking some memory.

0 replies

wangruohui · 2022-06-09T18:04:48Z

wangruohui
Jun 9, 2022
Collaborator

Sorry for some late as we are working on some heavy development.

As a very simple workaround, you can just disable evaluation by setting interval of the evaluation larger than the total iter of training.

evaluation = dict(interval=5000, save_image=False, gpu_collect=True)

As checkpoint saving works anyway, you can make some analysis after the training or in an offline way by using another GPU.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory when saving checkpoint #1199

{{title}}

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CUDA out of memory when saving checkpoint #1199

TonJ24 Jun 4, 2022

Replies: 7 comments

wangruohui Jun 4, 2022 Collaborator

TonJ24 Jun 4, 2022 Author

wangruohui Jun 4, 2022 Collaborator

wangruohui Jun 4, 2022 Collaborator

TonJ24 Jun 4, 2022 Author

TonJ24 Jun 5, 2022 Author

wangruohui Jun 9, 2022 Collaborator

TonJ24
Jun 4, 2022

wangruohui
Jun 4, 2022
Collaborator

TonJ24
Jun 4, 2022
Author

wangruohui
Jun 4, 2022
Collaborator

wangruohui
Jun 4, 2022
Collaborator

TonJ24
Jun 4, 2022
Author

TonJ24
Jun 5, 2022
Author

wangruohui
Jun 9, 2022
Collaborator