Why the pp_rank equals to torch.distributed.get_rank(group=self.optimizer.dp_process_group)? #4394

CrossEntropyZenk · 2023-09-25T02:51:34Z

CrossEntropyZenk
Sep 25, 2023

    def _get_zero_ckpt_name(self, checkpoints_path, tag):
        mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
        pp_rank = torch.distributed.get_rank(group=self.optimizer.dp_process_group)
        return self._get_rank_zero_ckpt_name(checkpoints_path, tag, mp_rank, pp_rank)

SdAswin · 2023-09-26T16:43:57Z

SdAswin
Sep 26, 2023

This pattern is commonly used in distributed training frameworks like PyTorch's Distributed Data Parallel (DDP) to determine the rank of a process within a specific group.the line of code you provided is a way to get the rank of the process within the optimizer's process group. This rank can then be used for various purposes in distributed training, such as deciding which part of the model to update or synchronizing operations within that group.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the pp_rank equals to torch.distributed.get_rank(group=self.optimizer.dp_process_group)? #4394

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why the pp_rank equals to torch.distributed.get_rank(group=self.optimizer.dp_process_group)? #4394

CrossEntropyZenk Sep 25, 2023

Replies: 1 comment

SdAswin Sep 26, 2023

CrossEntropyZenk
Sep 25, 2023

SdAswin
Sep 26, 2023