resume checkpoint w/o deepspeed #1662

tangjiasheng · 2021-12-27T12:00:54Z

tangjiasheng
Dec 27, 2021

So if I refactor my code with deepspeed, is there any sample codes about "load" and "save" ckpt with and without deepspeed?
To give a specific construction:
Suppose the original pytorch code is bult with timm trainer. After rewrite code with ds, e.g.

if args.deepspeed:
        model, optimizer, _, _ = deepspeed.initialize(args=args, model=model, optimizer=optimizer)

and so for lines within train_one_epoch, here are some detailed questions I want to know:

Do I need to change the checkpoint saving codes to use deepspeed.save_checkpoint?
If not, how to resume from the saved ckpt(saved with raw pytorch)? especially for optimizer and loss_scaler.
If so, like the question from above, how to use deepspeed.load_checkpoint to resume the ds format ckpt?
Thanks a lot for answering my questions.

tangjiasheng · 2021-12-27T12:55:29Z

tangjiasheng
Dec 27, 2021
Author

Update:
I tried to add the following codes right after the lr_scheduler is created:

if args.deepspeed:
    model, optimizer, _, _ = deepspeed.initialize(args=args, model=model, optimizer=optimizer)
    if args.resume:
        ckpt = torch.load(args.resume)
        optimizer = optimizer.load_state_dict(ckpt['optimizer'])
        # I think it's no need to resume loss_scaler from ckpt, right?

and set None on L451/452 for optimizer and loss_scaler in order to keep consistency with timm. Thus, I think the model weights are resumed right at L450 and the remaining parts (maybe the only part?) to be resumed is the optimizer.
It seems a little right with this example, but failed with this line for asking for [0] indices.
Then I slightly add a pair of '[]' to ckpt['optimizer']--although it's quite wired. Then it raised an error on the tensor dimension unmatch.

RuntimeError: The size of tensor a (63056) must match the size of tensor b (1971) at non-singleton dimension 0

Where 63056/1971 ~= 32. And in my resumed ckpt, I just used 32 cards for training with ds. Thus, I can just infer that above usage on resuming from a checkpoint file is right and what I need to change is how to refactor the saving code with ds, right?

And going further, it points out that I need to change the saver operation in timm to save_checkpoint in deepspeed.
If my previous guess are right, the final question is:
Is this save_checkpoint op in ds safe for share file system? I know the load op in ds works fine for share file system.

Please correct me if there is any problem of all previous analysis!

0 replies

tjruwase · 2022-01-03T19:38:30Z

tjruwase
Jan 3, 2022
Maintainer

@tangjiasheng, thanks for your question. I don't fully understand all parts of your question, so I will address just some points.

Yes you need to use deepspeed.save_checkpoint() in order to load with deepspeed.load_checkpoint().
Yes, deepspeed.save_checkpoint() works on shared file systems like nfs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume checkpoint w/o deepspeed #1662

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

resume checkpoint w/o deepspeed #1662

tangjiasheng Dec 27, 2021

Replies: 2 comments

tangjiasheng Dec 27, 2021 Author

tjruwase Jan 3, 2022 Maintainer

tangjiasheng
Dec 27, 2021

tangjiasheng
Dec 27, 2021
Author

tjruwase
Jan 3, 2022
Maintainer