resume checkpoint w/o deepspeed #1662
Replies: 2 comments
-
Update:
and set
Where 63056/1971 ~= 32. And in my resumed ckpt, I just used 32 cards for training with ds. Thus, I can just infer that above usage on resuming from a checkpoint file is right and what I need to change is how to refactor the saving code with ds, right? And going further, it points out that I need to change the saver operation in timm to save_checkpoint in deepspeed. Please correct me if there is any problem of all previous analysis! |
Beta Was this translation helpful? Give feedback.
-
@tangjiasheng, thanks for your question. I don't fully understand all parts of your question, so I will address just some points.
|
Beta Was this translation helpful? Give feedback.
-
So if I refactor my code with deepspeed, is there any sample codes about "load" and "save" ckpt with and without deepspeed?
To give a specific construction:
Suppose the original pytorch code is bult with timm trainer. After rewrite code with ds, e.g.
and so for lines within train_one_epoch, here are some detailed questions I want to know:
Thanks a lot for answering my questions.
Beta Was this translation helpful? Give feedback.
All reactions