You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi veRL team,
I have encountered OOM issues and want to reduce memory usage by enabling gradient checkpointing.
However, I find the parameter critic.model.enable_gradient_checkpointing does not change the memory usage of critic model in FSDP workers. This is how I draw the conclusion: I profiled the max-reserved-memory and max-allocated-memory before and after update_critic(). Both values increased after calling update_critic(), so I think they can reflect the max memory usage during update_critic().
However, when turning on critic.model.enable_gradient_checkpointing, these values remain the same as when it is turned off.
I'd like to ask, how to ensure gradient checkpointing is enabled?
I also find that the parameter enable_gradient_checkpointing is not used by Megatron workers. How to enable gradient checkpointing in Megatron workers? Thanks.
The text was updated successfully, but these errors were encountered:
Hi, @Vamix
For FSDP, we fixed the critic gradient checkpoint issue in this pr: #27 You can try it.
For Megatron-LM, as the ParallelLlama Model didn't support Gradient Checkpoint, we may fail to enable this in Megatron Workers. Would you like to add this feature in the ParallelLlama Model?
Hi veRL team,
I have encountered OOM issues and want to reduce memory usage by enabling gradient checkpointing.
However, I find the parameter
critic.model.enable_gradient_checkpointing
does not change the memory usage of critic model in FSDP workers. This is how I draw the conclusion: I profiled the max-reserved-memory and max-allocated-memory before and after update_critic(). Both values increased after calling update_critic(), so I think they can reflect the max memory usage during update_critic().However, when turning on
critic.model.enable_gradient_checkpointing
, these values remain the same as when it is turned off.I'd like to ask, how to ensure gradient checkpointing is enabled?
I also find that the parameter
enable_gradient_checkpointing
is not used by Megatron workers. How to enable gradient checkpointing in Megatron workers? Thanks.The text was updated successfully, but these errors were encountered: