Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcore with flash checkpoints error #1446

Open
Hiwyl opened this issue Jan 20, 2025 · 3 comments
Open

mcore with flash checkpoints error #1446

Hiwyl opened this issue Jan 20, 2025 · 3 comments
Labels
question Further information is requested

Comments

@Hiwyl
Copy link

Hiwyl commented Jan 20, 2025

  • tp1pp1dp8
  • hang
    [2025-01-20 09:36:57] iteration 60/ 122 | consumed samples: 480 | elapsed time per iteration (ms): 327.0 | throughput per GPU (TFLOP/s/GPU): 136.7 | learning rate: 5.615866E-06 | global batch size: 8 | lm loss: 2.298809E+00 | loss scale: 1.0 | grad norm: 3.736 | number of skipped iterations: 0 | number of nan iterations: 0 |
    saving checkpoint at iteration 60 to /data/wyl_data/output_mcore_qwen2_pretrain_fcp/checkpoint/pretrain-mcore-llama3-1-7B-lr-1e-5-minlr-1e-6-bs-1-gbs-8-seqlen-1024-pr-bf16-tp-1-pp-1-cp-1-ac-false-do-true-sp-false-ti-122-wi-0
    [2025-01-20 09:36:57,064] [INFO] [engine.py:356:save_state_dict_to_memory] 3-3 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,064] [INFO] [engine.py:356:save_state_dict_to_memory] 5-5 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,065] [INFO] [engine.py:356:save_state_dict_to_memory] 4-4 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,066] [INFO] [engine.py:356:save_state_dict_to_memory] 7-7 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,067] [INFO] [engine.py:356:save_state_dict_to_memory] 1-1 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,067] [INFO] [engine.py:356:save_state_dict_to_memory] 0-0 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,069] [INFO] [engine.py:356:save_state_dict_to_memory] 6-6 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,069] [INFO] [engine.py:356:save_state_dict_to_memory] 2-2 acquired the lock of shared memory: True for step: 60.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 2 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 6 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:362:save_state_dict_to_memory] Rank 4 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
    [2025-01-20 09:36:57,147] [INFO] [engine.py:104:wrapper] Local rank 7 execute save_to_memory in 0.084s.
    [2025-01-20 09:36:57,148] [INFO] [engine.py:104:wrapper] Local rank 5 execute save_to_memory in 0.085s.
    [2025-01-20 09:36:57,148] [INFO] [engine.py:104:wrapper] Local rank 3 execute save_to_memory in 0.085s.
    [2025-01-20 09:36:57,149] [INFO] [engine.py:104:wrapper] Local rank 6 execute save_to_memory in 0.086s.
    [2025-01-20 09:36:57,150] [INFO] [engine.py:104:wrapper] Local rank 4 execute save_to_memory in 0.086s.
    [2025-01-20 09:36:57,150] [INFO] [engine.py:104:wrapper] Local rank 2 execute save_to_memory in 0.087s.
    [2025-01-20 09:36:57,151] [INFO] [engine.py:104:wrapper] Local rank 1 execute save_to_memory in 0.087s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 0 execute save_to_memory in 1.1s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 2 execute save_to_storage in 1.103s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 1 execute save_to_storage in 1.103s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 3 execute save_to_storage in 1.103s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 4 execute save_to_storage in 1.103s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 5 execute save_to_storage in 1.103s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 6 execute save_to_storage in 1.103s.
    [2025-01-20 09:36:58,166] [INFO] [engine.py:104:wrapper] Local rank 7 execute save_to_storage in 1.103s.
    [2025-01-20 09:37:01,219] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:06,225] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:11,227] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:16,233] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:21,238] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:26,244] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:31,249] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:36,254] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:41,260] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:46,265] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:51,271] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:37:56,276] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:38:01,281] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:38:06,287] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:38:11,292] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
    [2025-01-20 09:38:16,298] [INFO] [ckpt_saver.py:1055:commit_checkpoint] The number of ready shards is 1 != 8.
@Hiwyl Hiwyl added the question Further information is requested label Jan 20, 2025
@Hiwyl
Copy link
Author

Hiwyl commented Jan 20, 2025

run with dlrover-run and torchrun is same problem.

@Hiwyl
Copy link
Author

Hiwyl commented Jan 21, 2025

  • 使用tp2pp1dp4
  • dlover: 0.4.0
  • mcore : 0.9.0
  • torch : 2.1.0

Image

@BalaBalaYi
Copy link
Collaborator

Need the stack info for each rank(especially the one who didn't finish ckpt)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants