Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain fp32 for optimizer state when offloading is enabled #1223

Merged
merged 1 commit into from
Feb 6, 2025

Conversation

zhenying-liu
Copy link
Contributor

Address issues with optimizer state offloading and data type conversion.

We identified two issues concerning the conversion from fp32 to fp16 for the optimizer state when enabling optimizer state offloading:

The comparison between configurations without and with optimizer state offloading was unfair because the data sizes differed, with the former using fp32 and the latter using fp16.
The presence of two modules with jit_train_step due to separate versions for fp32 and fp16 created inconsistencies.

This commit removes the fp32 to fp16 conversion, ensuring that the optimizer state retains its original data type.

We observed no memory savings when switching from f16 to f32 previously. The root cause is that the GPU memory scheduler does not distinguish between CPU memory and GPU memory. This XLA PR modifies the scheduler to exclude CPU memory and is merged so that we could reenable the CL (#1184) again.

@zhenying-liu
Copy link
Contributor Author

I verified there were GPU memory savings with this PR and the XLA fix when optimizer state offloading was turned on using llama2-7b. Also there is only one jit_train_step generated.

@zhenying-liu
Copy link
Contributor Author

Please take a look for the code review. The failure of "Error: No task list was present and requireChecklist is turned on" seems not a real failure.

@copybara-service copybara-service bot merged commit 539d282 into AI-Hypercomputer:main Feb 6, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants