We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我在连续两次训练过程中都出现tensor([[[nan, nan, nan, nan]。 这一次是在第61个epoch出现的,不知道是因为梯度爆炸还是什么原因。求大佬解惑~!
Epoch: [61] [450/529] eta: 0:03:54 lr: 0.000400 loss: 18.9591 (18.8312) loss_vfl: 0.5493 (0.5760) loss_bbox: 0.0470 (0.0610) loss_giou: 0.8212 (0.8973) loss_fgl: 0.8530 (0.8951) loss_vfl_aux_0: 0.6069 (0.6203) loss_bbox_aux_0: 0.0527 (0.0641) loss_giou_aux_0: 0.8506 (0.9305) loss_fgl_aux_0: 0.8965 (0.9097) loss_ddf_aux_0: 0.0302 (0.0336) loss_vfl_aux_1: 0.5820 (0.6043) loss_bbox_aux_1: 0.0469 (0.0611) loss_giou_aux_1: 0.8329 (0.8990) loss_fgl_aux_1: 0.8545 (0.8954) loss_ddf_aux_1: 0.0010 (0.0011) loss_vfl_pre: 0.6011 (0.6175) loss_bbox_pre: 0.0547 (0.0647) loss_giou_pre: 0.8570 (0.9302) loss_vfl_enc_0: 0.5552 (0.5805) loss_bbox_enc_0: 0.0734 (0.0819) loss_giou_enc_0: 1.2562 (1.1298) loss_vfl_dn_0: 0.5015 (0.5109) loss_bbox_dn_0: 0.0260 (0.0276) loss_giou_dn_0: 0.6742 (0.6495) loss_fgl_dn_0: 1.1377 (1.1564) loss_ddf_dn_0: 0.1591 (0.1573) loss_vfl_dn_1: 0.4592 (0.4684) loss_bbox_dn_1: 0.0200 (0.0205) loss_giou_dn_1: 0.5463 (0.5493) loss_fgl_dn_1: 1.0902 (1.1137) loss_ddf_dn_1: 0.0022 (0.0027) loss_vfl_dn_2: 0.4412 (0.4521) loss_bbox_dn_2: 0.0199 (0.0204) loss_giou_dn_2: 0.5470 (0.5468) loss_fgl_dn_2: 1.0903 (1.1132) loss_ddf_dn_2: 0.0000 (0.0000) loss_vfl_dn_pre: 0.5015 (0.5114) loss_bbox_dn_pre: 0.0258 (0.0277) loss_giou_dn_pre: 0.6734 (0.6502) time: 2.9735 data: 1.5957 max mem: 4055 tensor([[[nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan], ..., [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan]]], device='cuda:0', grad_fn=) [rank0]: Traceback (most recent call last): [rank0]: File "/root/code/DETR/D-FINE/train.py", line 84, in [rank0]: main(args) [rank0]: File "/root/code/DETR/D-FINE/train.py", line 54, in main [rank0]: solver.fit() [rank0]: File "/root/code/DETR/D-FINE/src/solver/det_solver.py", line 63, in fit [rank0]: train_stats = train_one_epoch( [rank0]: File "/root/code/DETR/D-FINE/src/solver/det_engine.py", line 63, in train_one_epoch [rank0]: loss_dict = criterion(outputs, targets, **metas) [rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/dfine_criterion.py", line 238, in forward [rank0]: indices = self.matcher(outputs_without_aux, targets)['indices'] [rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/matcher.py", line 102, in forward [rank0]: cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) [rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/box_ops.py", line 54, in generalized_box_iou [rank0]: assert (boxes1[:, 2:] >= boxes1[:, :2]).all() [rank0]: AssertionError E0103 17:58:21.745000 2829 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2835) of binary: /root/env/miniconda3/envs/yolo/bin/python Traceback (most recent call last): File "/root/env/miniconda3/envs/yolo/bin/torchrun", line 8, in sys.exit(main()) File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, **kwargs) File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-01-03_17:58:21 host : WINDOWS-QDKD0IG. rank : 0 (local_rank: 0) exitcode : 1 (pid: 2835) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered:
No branches or pull requests
我在连续两次训练过程中都出现tensor([[[nan, nan, nan, nan]。 这一次是在第61个epoch出现的,不知道是因为梯度爆炸还是什么原因。求大佬解惑~!
Epoch: [61] [450/529] eta: 0:03:54 lr: 0.000400 loss: 18.9591 (18.8312) loss_vfl: 0.5493 (0.5760) loss_bbox: 0.0470 (0.0610) loss_giou: 0.8212 (0.8973) loss_fgl: 0.8530 (0.8951) loss_vfl_aux_0: 0.6069 (0.6203) loss_bbox_aux_0: 0.0527 (0.0641) loss_giou_aux_0: 0.8506 (0.9305) loss_fgl_aux_0: 0.8965 (0.9097) loss_ddf_aux_0: 0.0302 (0.0336) loss_vfl_aux_1: 0.5820 (0.6043) loss_bbox_aux_1: 0.0469 (0.0611) loss_giou_aux_1: 0.8329 (0.8990) loss_fgl_aux_1: 0.8545 (0.8954) loss_ddf_aux_1: 0.0010 (0.0011) loss_vfl_pre: 0.6011 (0.6175) loss_bbox_pre: 0.0547 (0.0647) loss_giou_pre: 0.8570 (0.9302) loss_vfl_enc_0: 0.5552 (0.5805) loss_bbox_enc_0: 0.0734 (0.0819) loss_giou_enc_0: 1.2562 (1.1298) loss_vfl_dn_0: 0.5015 (0.5109) loss_bbox_dn_0: 0.0260 (0.0276) loss_giou_dn_0: 0.6742 (0.6495) loss_fgl_dn_0: 1.1377 (1.1564) loss_ddf_dn_0: 0.1591 (0.1573) loss_vfl_dn_1: 0.4592 (0.4684) loss_bbox_dn_1: 0.0200 (0.0205) loss_giou_dn_1: 0.5463 (0.5493) loss_fgl_dn_1: 1.0902 (1.1137) loss_ddf_dn_1: 0.0022 (0.0027) loss_vfl_dn_2: 0.4412 (0.4521) loss_bbox_dn_2: 0.0199 (0.0204) loss_giou_dn_2: 0.5470 (0.5468) loss_fgl_dn_2: 1.0903 (1.1132) loss_ddf_dn_2: 0.0000 (0.0000) loss_vfl_dn_pre: 0.5015 (0.5114) loss_bbox_dn_pre: 0.0258 (0.0277) loss_giou_dn_pre: 0.6734 (0.6502) time: 2.9735 data: 1.5957 max mem: 4055
tensor([[[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]]], device='cuda:0', grad_fn=)
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/code/DETR/D-FINE/train.py", line 84, in
[rank0]: main(args)
[rank0]: File "/root/code/DETR/D-FINE/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/root/code/DETR/D-FINE/src/solver/det_solver.py", line 63, in fit
[rank0]: train_stats = train_one_epoch(
[rank0]: File "/root/code/DETR/D-FINE/src/solver/det_engine.py", line 63, in train_one_epoch
[rank0]: loss_dict = criterion(outputs, targets, **metas)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/dfine_criterion.py", line 238, in forward
[rank0]: indices = self.matcher(outputs_without_aux, targets)['indices']
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/matcher.py", line 102, in forward
[rank0]: cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
[rank0]: File "/root/code/DETR/D-FINE/src/zoo/dfine/box_ops.py", line 54, in generalized_box_iou
[rank0]: assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
[rank0]: AssertionError
E0103 17:58:21.745000 2829 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2835) of binary: /root/env/miniconda3/envs/yolo/bin/python
Traceback (most recent call last):
File "/root/env/miniconda3/envs/yolo/bin/torchrun", line 8, in
sys.exit(main())
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/env/miniconda3/envs/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-01-03_17:58:21
host : WINDOWS-QDKD0IG.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2835)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: