Maintain fp32 for optimizer state when offloading is enabled #1223

zhenying-liu · 2025-01-31T21:17:03Z

Address issues with optimizer state offloading and data type conversion.

We identified two issues concerning the conversion from fp32 to fp16 for the optimizer state when enabling optimizer state offloading:

The comparison between configurations without and with optimizer state offloading was unfair because the data sizes differed, with the former using fp32 and the latter using fp16.
The presence of two modules with jit_train_step due to separate versions for fp32 and fp16 created inconsistencies.

This commit removes the fp32 to fp16 conversion, ensuring that the optimizer state retains its original data type.

We observed no memory savings when switching from f16 to f32 previously. The root cause is that the GPU memory scheduler does not distinguish between CPU memory and GPU memory. This XLA PR modifies the scheduler to exclude CPU memory and is merged so that we could reenable the CL (#1184) again.

zhenying-liu · 2025-01-31T22:15:31Z

I verified there were GPU memory savings with this PR and the XLA fix when optimizer state offloading was turned on using llama2-7b. Also there is only one jit_train_step generated.

zhenying-liu · 2025-02-05T18:25:28Z

Please take a look for the code review. The failure of "Error: No task list was present and requireChecklist is turned on" seems not a real failure.

Maintain fp32 for optimizer state when offloading is enabled

ba9e659

zhenying-liu requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners January 31, 2025 21:17

khatwanimohit approved these changes Feb 5, 2025

View reviewed changes

khatwanimohit added the pull ready label Feb 5, 2025

copybara-service bot merged commit 539d282 into AI-Hypercomputer:main Feb 6, 2025
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain fp32 for optimizer state when offloading is enabled #1223

Maintain fp32 for optimizer state when offloading is enabled #1223

zhenying-liu commented Jan 31, 2025

zhenying-liu commented Jan 31, 2025

zhenying-liu commented Feb 5, 2025

Maintain fp32 for optimizer state when offloading is enabled #1223

Maintain fp32 for optimizer state when offloading is enabled #1223

Conversation

zhenying-liu commented Jan 31, 2025

zhenying-liu commented Jan 31, 2025

zhenying-liu commented Feb 5, 2025