Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable_gradient_checkpointing is not working #26

Open
Vamix opened this issue Nov 27, 2024 · 2 comments
Open

enable_gradient_checkpointing is not working #26

Vamix opened this issue Nov 27, 2024 · 2 comments

Comments

@Vamix
Copy link

Vamix commented Nov 27, 2024

Hi veRL team,
I have encountered OOM issues and want to reduce memory usage by enabling gradient checkpointing.
However, I find the parameter critic.model.enable_gradient_checkpointing does not change the memory usage of critic model in FSDP workers. This is how I draw the conclusion: I profiled the max-reserved-memory and max-allocated-memory before and after update_critic(). Both values increased after calling update_critic(), so I think they can reflect the max memory usage during update_critic().
However, when turning on critic.model.enable_gradient_checkpointing, these values remain the same as when it is turned off.
I'd like to ask, how to ensure gradient checkpointing is enabled?

I also find that the parameter enable_gradient_checkpointing is not used by Megatron workers. How to enable gradient checkpointing in Megatron workers? Thanks.

@PeterSH6
Copy link
Collaborator

Hi, @Vamix
For FSDP, we fixed the critic gradient checkpoint issue in this pr: #27 You can try it.

For Megatron-LM, as the ParallelLlama Model didn't support Gradient Checkpoint, we may fail to enable this in Megatron Workers. Would you like to add this feature in the ParallelLlama Model?

@Vamix
Copy link
Author

Vamix commented Dec 2, 2024

Hi @PeterSH6 , thanks for the pr #27, it fixes the gradient checkpoint issue.

For the gradient checkpoint in Megatron-LM, I'll try and I'll let you know once I finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants