Gradient synchronization in ZeRO-1/2/3 #5560

R0n12 · 2024-05-22T21:13:45Z

R0n12
May 22, 2024

@tjruwase Currently we are doing a communication profiling with different zero stages (1,2,3) using GPT-NeoX 19M config. We enabled comms_logger to look at the actual collectives being called, their message size distribution and frequency.

As per our understanding, zero-1 should use allgather + reduce-scatter to implement gradient allreduce, however, we are seeing one large message size allreduce in our profiling results, is there a specific reason for allreduce instead of reduce-scatter + allgather?

same thing also happend with zero-2, it is only calling allreduce.

ZeRO-1

[2024-05-16 20:30:21,554] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: reduce_ipg_grads] | [Line Number: 1367] | time (ms): 6.14 | msg size: 134.33 MB | algbw (Gbps): 367.05 | busbw (Gbps): 321.17
[2024-05-16 20:30:21,953] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: _take_model_step] | [Line Number: 2075] | time (ms): 23.57 | msg size: 1.0 B | algbw (Gbps): 0.00 | busbw (Gbps): 0.00
[2024-05-16 20:30:22,250] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: step] | [Line Number: 2169] | time (ms): 12.52 | msg size: 4.0 B | algbw (Gbps): 0.00 | busbw (Gbps): 0.00
[2024-05-16 20:30:22,253] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: step] | [Line Number: 2169] | time (ms): 0.35 | msg size: 4.0 B | algbw (Gbps): 0.00 | busbw (Gbps): 0.00
[2024-05-16 20:30:22,376] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_gather_into_tensor | [Caller Func: step] | [Line Number: 2169] | time (ms): 1.31 | msg size: 16.78 MB | algbw (Gbps): 862.25 | busbw (Gbps): 754.47
[2024-05-16 20:30:22,376] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_gather_into_tensor | [Caller Func: step] | [Line Number: 2169] | time (ms): 0.14 | msg size: 10.0 KB | algbw (Gbps): 4.82 | busbw (Gbps): 4.22
[2024-05-16 20:30:22,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.91 | optimizer_gradients: 0.27 | optimizer_step: 120.55
[2024-05-16 20:30:22,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2433.78 | bwd_microstep: 2228.13 | bwd_inner_microstep: 2218.78 | bwd_allreduce_microstep: 9.20 | step_microstep: 821.16
[2024-05-16 20:30:22,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2433.77 | bwd: 2228.16 | bwd_inner: 2218.77 | bwd_allreduce: 9.22 | step: 821.15
 samples/sec: 18.510 | iteration        1/      20 | elapsed time per iteration (ms): 6915.4 | learning rate: 9.964E-05 | approx flops per GPU: 1.9TFLOPS | lm_loss: 1.100205E+01 | loss scale: 65536.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 235.51904296875 | max allocated: 11768.20751953125 | reserved: 13152.0 | max reserved: 13152.0
time (ms) | forward: 3743.69 | backward: 2348.91 | backward-backward: 2348.87 | backward-allreduce: 0.00 | optimizer: 821.51 | batch generator: 384.37

config:

"zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true, 
        "round_robin_gradients": true
    },

ZeRO-2

[2024-05-16 20:30:56,092] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: reduce_ipg_grads] | [Line Number: 1367] | time (ms): 106.52 | msg size: 134.33 MB | algbw (Gbps): 21.16 | busbw (Gbps): 18.51
[2024-05-16 20:30:56,477] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: _take_model_step] | [Line Number: 2075] | time (ms): 7.05 | msg size: 1.0 B | algbw (Gbps): 0.00 | busbw (Gbps): 0.00
[2024-05-16 20:30:56,758] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: step] | [Line Number: 2169] | time (ms): 1.04 | msg size: 4.0 B | algbw (Gbps): 0.00 | busbw (Gbps): 0.00
[2024-05-16 20:30:56,760] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_reduce | [Caller Func: step] | [Line Number: 2169] | time (ms): 0.18 | msg size: 4.0 B | algbw (Gbps): 0.00 | busbw (Gbps): 0.00
[2024-05-16 20:30:56,879] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_gather_into_tensor | [Caller Func: step] | [Line Number: 2169] | time (ms): 1.44 | msg size: 16.78 MB | algbw (Gbps): 779.63 | busbw (Gbps): 682.18
[2024-05-16 20:30:56,879] [INFO] [logging.py:96:log_dist] [Rank 0] comm op: all_gather_into_tensor | [Caller Func: step] | [Line Number: 2169] | time (ms): 0.15 | msg size: 10.0 KB | algbw (Gbps): 4.46 | busbw (Gbps): 3.90
[2024-05-16 20:30:56,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.13 | optimizer_gradients: 0.32 | optimizer_step: 115.91
[2024-05-16 20:30:56,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2632.14 | bwd_microstep: 2102.36 | bwd_inner_microstep: 1992.95 | bwd_allreduce_microstep: 109.33 | step_microstep: 786.15
[2024-05-16 20:30:56,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2632.13 | bwd: 2102.36 | bwd_inner: 1992.94 | bwd_allreduce: 109.34 | step: 786.16
 samples/sec: 18.358 | iteration        1/      20 | elapsed time per iteration (ms): 6972.5 | learning rate: 9.964E-05 | approx flops per GPU: 1.9TFLOPS | lm_loss: 1.100205E+01 | loss scale: 65536.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 235.51904296875 | max allocated: 11768.20751953125 | reserved: 13152.0 | max reserved: 13152.0

config:

"zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true, 
        "round_robin_gradients": true
    },

@Quentin-Anthony @jahatef @BTMichalowicz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient synchronization in ZeRO-1/2/3 #5560

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Gradient synchronization in ZeRO-1/2/3 #5560

R0n12 May 22, 2024

ZeRO-1

ZeRO-2

Replies: 0 comments

R0n12
May 22, 2024