Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练过程出现tensor([[[nan, nan, nan, nan], #133

Open
CrazysCodes opened this issue Jan 3, 2025 · 0 comments
Open

训练过程出现tensor([[[nan, nan, nan, nan], #133

CrazysCodes opened this issue Jan 3, 2025 · 0 comments

Comments

@CrazysCodes
Copy link

我在连续两次训练过程中都出现tensor([[[nan, nan, nan, nan]。 这一次是在第61个epoch出现的,不知道是因为梯度爆炸还是什么原因。求大佬解惑~!

Epoch: [61] [450/529] eta: 0:03:54 lr: 0.000400 loss: 18.9591 (18.8312) loss_vfl: 0.5493 (0.5760) loss_bbox: 0.0470 (0.0610) loss_giou: 0.8212 (0.8973) loss_fgl: 0.8530 (0.8951) loss_vfl_aux_0: 0.6069 (0.6203) loss_bbox_aux_0: 0.0527 (0.0641) loss_giou_aux_0: 0.8506 (0.9305) loss_fgl_aux_0: 0.8965 (0.9097) loss_ddf_aux_0: 0.0302 (0.0336) loss_vfl_aux_1: 0.5820 (0.6043) loss_bbox_aux_1: 0.0469 (0.0611) loss_giou_aux_1: 0.8329 (0.8990) loss_fgl_aux_1: 0.8545 (0.8954) loss_ddf_aux_1: 0.0010 (0.0011) loss_vfl_pre: 0.6011 (0.6175) loss_bbox_pre: 0.0547 (0.0647) loss_giou_pre: 0.8570 (0.9302) loss_vfl_enc_0: 0.5552 (0.5805) loss_bbox_enc_0: 0.0734 (0.0819) loss_giou_enc_0: 1.2562 (1.1298) loss_vfl_dn_0: 0.5015 (0.5109) loss_bbox_dn_0: 0.0260 (0.0276) loss_giou_dn_0: 0.6742 (0.6495) loss_fgl_dn_0: 1.1377 (1.1564) loss_ddf_dn_0: 0.1591 (0.1573) loss_vfl_dn_1: 0.4592 (0.4684) loss_bbox_dn_1: 0.0200 (0.0205) loss_giou_dn_1: 0.5463 (0.5493) loss_fgl_dn_1: 1.0902 (1.1137) loss_ddf_dn_1: 0.0022 (0.0027) loss_vfl_dn_2: 0.4412 (0.4521) loss_bbox_dn_2: 0.0199 (0.0204) loss_giou_dn_2: 0.5470 (0.5468) loss_fgl_dn_2: 1.0903 (1.1132) loss_ddf_dn_2: 0.0000 (0.0000) loss_vfl_dn_pre: 0.5015 (0.5114) loss_bbox_dn_pre: 0.0258 (0.0277) loss_giou_dn_pre: 0.6734 (0.6502) time: 2.9735 data: 1.5957 max mem: 4055
tensor([[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]]], device='cuda:0', grad_fn=)
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/code/DETR/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/root/code/DETR/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/root/code/DETR/D-FINE/src/solver/det_solver.py", line 63, in fit
[rank0]: train_stats = train_one_epoch(
[rank0]: File "/root/code/DETR/D-FINE/src/solver/det_engine.py", line 63, in train_one_epoch
[rank0]: loss_dict = criterion(outputs, targets, **metas)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/dfine_criterion.py", line 238, in forward
[rank0]: indices = self.matcher(outputs_without_aux, targets)['indices']
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/matcher.py", line 102, in forward
[rank0]: cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
[rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/box_ops.py", line 54, in generalized_box_iou
[rank0]: assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
[rank0]: AssertionError
E0103 17:58:21.745000 2829 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2835) of binary: /root/env/miniconda3/envs/yolo/bin/python
Traceback (most recent call last):
File "/root/env/miniconda3/envs/yolo/bin/torchrun", line 8, in
sys.exit(main())
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-03_17:58:21
host : WINDOWS-QDKD0IG.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2835)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant