You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hang
[2025-01-20 09:36:57] iteration 60/ 122 | consumed samples: 480 | elapsed time per iteration (ms): 327.0 | throughput per GPU (TFLOP/s/GPU): 136.7 | learning rate: 5.615866E-06 | global batch size: 8 | lm loss: 2.298809E+00 | loss scale: 1.0 | grad norm: 3.736 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 60 to /data/wyl_data/output_mcore_qwen2_pretrain_fcp/checkpoint/pretrain-mcore-llama3-1-7B-lr-1e-5-minlr-1e-6-bs-1-gbs-8-seqlen-1024-pr-bf16-tp-1-pp-1-cp-1-ac-false-do-true-sp-false-ti-122-wi-0
[2025-01-20 09:36:57,064] [INFO] [engine.py:356:save_state_dict_to_memory] 3-3 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,064] [INFO] [engine.py:356:save_state_dict_to_memory] 5-5 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,065] [INFO] [engine.py:356:save_state_dict_to_memory] 4-4 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,066] [INFO] [engine.py:356:save_state_dict_to_memory] 7-7 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,067] [INFO] [engine.py:356:save_state_dict_to_memory] 1-1 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,067] [INFO] [engine.py:356:save_state_dict_to_memory] 0-0 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,069] [INFO] [engine.py:356:save_state_dict_to_memory] 6-6 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,069] [INFO] [engine.py:356:save_state_dict_to_memory] 2-2 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 2 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 6 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 4 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:104:wrapper] Local rank 7 execute save_to_memory in 0.084s.
[2025-01-20 09:36:57,148] [INFO] [engine.py:104:wrapper] Local rank 5 execute save_to_memory in 0.085s.
[2025-01-20 09:36:57,148] [INFO] [engine.py:104:wrapper] Local rank 3 execute save_to_memory in 0.085s.
[2025-01-20 09:36:57,149] [INFO] [engine.py:104:wrapper] Local rank 6 execute save_to_memory in 0.086s.
[2025-01-20 09:36:57,150] [INFO] [engine.py:104:wrapper] Local rank 4 execute save_to_memory in 0.086s.
[2025-01-20 09:36:57,150] [INFO] [engine.py:104:wrapper] Local rank 2 execute save_to_memory in 0.087s.
[2025-01-20 09:36:57,151] [INFO] [engine.py:104:wrapper] Local rank 1 execute save_to_memory in 0.087s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 0 execute save_to_memory in 1.1s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 2 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 1 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 3 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 4 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 5 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 6 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 7 execute save_to_storage in 1.103s.
[2025-01-20 09:37:01,219] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:06,225] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:11,227] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:16,233] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:21,238] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:26,244] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:31,249] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:36,254] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:41,260] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:46,265] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:51,271] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:56,276] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:01,281] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:06,287] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:11,292] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:16,298] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
The text was updated successfully, but these errors were encountered:
[2025-01-20 09:36:57] iteration 60/ 122 | consumed samples: 480 | elapsed time per iteration (ms): 327.0 | throughput per GPU (TFLOP/s/GPU): 136.7 | learning rate: 5.615866E-06 | global batch size: 8 | lm loss: 2.298809E+00 | loss scale: 1.0 | grad norm: 3.736 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 60 to /data/wyl_data/output_mcore_qwen2_pretrain_fcp/checkpoint/pretrain-mcore-llama3-1-7B-lr-1e-5-minlr-1e-6-bs-1-gbs-8-seqlen-1024-pr-bf16-tp-1-pp-1-cp-1-ac-false-do-true-sp-false-ti-122-wi-0
[2025-01-20 09:36:57,064] [INFO] [engine.py:356:save_state_dict_to_memory] 3-3 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,064] [INFO] [engine.py:356:save_state_dict_to_memory] 5-5 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,065] [INFO] [engine.py:356:save_state_dict_to_memory] 4-4 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,066] [INFO] [engine.py:356:save_state_dict_to_memory] 7-7 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,067] [INFO] [engine.py:356:save_state_dict_to_memory] 1-1 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,067] [INFO] [engine.py:356:save_state_dict_to_memory] 0-0 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,069] [INFO] [engine.py:356:save_state_dict_to_memory] 6-6 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,069] [INFO] [engine.py:356:save_state_dict_to_memory] 2-2 acquired the lock of shared memory: True for step: 60.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 2 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 6 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 4 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2025-01-20 09:36:57,147] [INFO] [engine.py:104:wrapper] Local rank 7 execute save_to_memory in 0.084s.
[2025-01-20 09:36:57,148] [INFO] [engine.py:104:wrapper] Local rank 5 execute save_to_memory in 0.085s.
[2025-01-20 09:36:57,148] [INFO] [engine.py:104:wrapper] Local rank 3 execute save_to_memory in 0.085s.
[2025-01-20 09:36:57,149] [INFO] [engine.py:104:wrapper] Local rank 6 execute save_to_memory in 0.086s.
[2025-01-20 09:36:57,150] [INFO] [engine.py:104:wrapper] Local rank 4 execute save_to_memory in 0.086s.
[2025-01-20 09:36:57,150] [INFO] [engine.py:104:wrapper] Local rank 2 execute save_to_memory in 0.087s.
[2025-01-20 09:36:57,151] [INFO] [engine.py:104:wrapper] Local rank 1 execute save_to_memory in 0.087s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 0 execute save_to_memory in 1.1s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 2 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 1 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 3 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 4 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 5 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 6 execute save_to_storage in 1.103s.
[2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 7 execute save_to_storage in 1.103s.
[2025-01-20 09:37:01,219] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:06,225] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:11,227] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:16,233] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:21,238] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:26,244] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:31,249] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:36,254] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:41,260] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:46,265] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:51,271] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:37:56,276] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:01,281] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:06,287] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:11,292] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
[2025-01-20 09:38:16,298] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
The text was updated successfully, but these errors were encountered: