enable_gradient_checkpointing is not working #26

Vamix · 2024-11-27T09:02:00Z

Hi veRL team,
I have encountered OOM issues and want to reduce memory usage by enabling gradient checkpointing.
However, I find the parameter critic.model.enable_gradient_checkpointing does not change the memory usage of critic model in FSDP workers. This is how I draw the conclusion: I profiled the max-reserved-memory and max-allocated-memory before and after update_critic(). Both values increased after calling update_critic(), so I think they can reflect the max memory usage during update_critic().
However, when turning on critic.model.enable_gradient_checkpointing, these values remain the same as when it is turned off.
I'd like to ask, how to ensure gradient checkpointing is enabled?

I also find that the parameter enable_gradient_checkpointing is not used by Megatron workers. How to enable gradient checkpointing in Megatron workers? Thanks.

The text was updated successfully, but these errors were encountered:

PeterSH6 · 2024-11-29T08:04:59Z

Hi, @Vamix
For FSDP, we fixed the critic gradient checkpoint issue in this pr: #27 You can try it.

For Megatron-LM, as the ParallelLlama Model didn't support Gradient Checkpoint, we may fail to enable this in Megatron Workers. Would you like to add this feature in the ParallelLlama Model?

Vamix · 2024-12-02T06:25:23Z

Hi @PeterSH6 , thanks for the pr #27, it fixes the gradient checkpoint issue.

For the gradient checkpoint in Megatron-LM, I'll try and I'll let you know once I finish.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable_gradient_checkpointing is not working #26

enable_gradient_checkpointing is not working #26

Vamix commented Nov 27, 2024 •

edited

Loading

PeterSH6 commented Nov 29, 2024

Vamix commented Dec 2, 2024

enable_gradient_checkpointing is not working #26

enable_gradient_checkpointing is not working #26

Comments

Vamix commented Nov 27, 2024 • edited Loading

PeterSH6 commented Nov 29, 2024

Vamix commented Dec 2, 2024

Vamix commented Nov 27, 2024 •

edited

Loading