Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cudart error occurs when using YOLOv11 #69

Open
mark4653 opened this issue Jan 27, 2025 · 8 comments
Open

Cudart error occurs when using YOLOv11 #69

mark4653 opened this issue Jan 27, 2025 · 8 comments
Labels
implementation Unimplemented feature(s)

Comments

@mark4653
Copy link

mark4653 commented Jan 27, 2025

#error log

PS C:\Users\mark4653\Documents\python\zluda> .\zluda.exe -- C:\Users\mark4653\Documents\python\python.exe C:\Users\mark4653\Documents\python\run.py
Ultralytics 8.3.68 🚀 Python-3.12.8 torch-2.5.1+cu118 CUDA:0 (AMD Radeon RX 6700 XT [ZLUDA], 12272MiB)
engine\trainer: task=detect, mode=train, model=yolo11n.pt, data=coco8.yaml, epochs=100, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=None, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\detect\train2

               from  n    params  module                                       arguments

0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
2 -1 1 6640 ultralytics.nn.modules.block.C3k2 [32, 64, 1, False, 0.25]
3 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
4 -1 1 26080 ultralytics.nn.modules.block.C3k2 [64, 128, 1, False, 0.25]
5 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
6 -1 1 87040 ultralytics.nn.modules.block.C3k2 [128, 128, 1, True]
7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
8 -1 1 346112 ultralytics.nn.modules.block.C3k2 [256, 256, 1, True]
9 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
10 -1 1 249728 ultralytics.nn.modules.block.C2PSA [256, 256, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
13 -1 1 111296 ultralytics.nn.modules.block.C3k2 [384, 128, 1, False]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 1 32096 ultralytics.nn.modules.block.C3k2 [256, 64, 1, False]
17 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
19 -1 1 86720 ultralytics.nn.modules.block.C3k2 [192, 128, 1, False]
20 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 1 378880 ultralytics.nn.modules.block.C3k2 [384, 256, 1, True]
23 [16, 19, 22] 1 464912 ultralytics.nn.modules.head.Detect [80, [64, 128, 256]]
YOLO11n summary: 319 layers, 2,624,080 parameters, 2,624,064 gradients, 6.6 GFLOPs

Transferred 499/499 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks...
thread '' panicked at zluda_runtime\src\cudart.rs:3233:5:
not implemented
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread '' panicked at core\src\panicking.rs:221:5:
panic in a function that cannot unwind
stack backtrace:
0: 0x7ffec02b7aa1 - _cudaUnregisterFatBinary
1: 0x7ffec02c47aa - _cudaUnregisterFatBinary
2: 0x7ffec02b6287 - _cudaUnregisterFatBinary
3: 0x7ffec02b78e5 - _cudaUnregisterFatBinary
4: 0x7ffec02b8de7 - _cudaUnregisterFatBinary
5: 0x7ffec02b8bc7 - _cudaUnregisterFatBinary
6: 0x7ffec02b9473 - _cudaUnregisterFatBinary
7: 0x7ffec02b92c2 - _cudaUnregisterFatBinary
8: 0x7ffec02b81ef - _cudaUnregisterFatBinary
9: 0x7ffec02b8efe - _cudaUnregisterFatBinary
10: 0x7ffec02ccbb5 - _cudaUnregisterFatBinary
11: 0x7ffec02ccc63 - _cudaUnregisterFatBinary
12: 0x7ffec02ccce1 - _cudaUnregisterFatBinary
13: 0x7ffec02b1c53 - cudaGetErrorString
14: 0x7ffedecdf730 - _CxxFrameHandler3
15: 0x7ffedecd33d8 - is_exception_typeof
16: 0x7ffef2414c96 - RtlCaptureContext2
17: 0x7ffec02b1c3a - cudaGetErrorString
18: 0x7ffeb1a835d3 - c10::cuda::c10_cuda_check_implementation
19: 0x7ffeb1a846cc - c10::cuda::device_synchronize
20: 0x7ffcbc406f01 - torch::lazy::GetPythonFrames
21: 0x7ffe16f4532a - PyObject_Vectorcall
22: 0x7ffe16f44cb5 - PyObject_Vectorcall
23: 0x7ffe16f461ad - PyEval_EvalFrameDefault
24: 0x7ffe16f4432c - PyFunction_Vectorcall
25: 0x7ffe16f3f2f8 - PyArg_CheckPositional
26: 0x7ffe16f24249 - Py_CheckFunctionResult
27: 0x7ffe16f4a0fd - PyEval_EvalFrameDefault
28: 0x7ffe16f98b4a - PyErr_SetNone
29: 0x7ffe16f98a0f - PyErr_SetNone
30: 0x7ffe16f23cb9 - Py_CheckFunctionResult
31: 0x7ffe16f44d4f - PyObject_Vectorcall
32: 0x7ffe16f44cb5 - PyObject_Vectorcall
33: 0x7ffe16f461ad - PyEval_EvalFrameDefault
34: 0x7ffe16f4011e - PyArg_CheckPositional
35: 0x7ffe16f3f6d8 - PyArg_CheckPositional
36: 0x7ffe16f3f538 - PyArg_CheckPositional
37: 0x7ffe16f44d4f - PyObject_Vectorcall
38: 0x7ffe16f44cb5 - PyObject_Vectorcall
39: 0x7ffe16f461ad - PyEval_EvalFrameDefault
40: 0x7ffe16f4432c - PyFunction_Vectorcall
41: 0x7ffe16f27313 - PyObject_FastCallDictTstate
42: 0x7ffe1704fbab - PyObject_Call_Prepend
43: 0x7ffe1704fad6 - PyBytesWriter_Resize
44: 0x7ffe16f45302 - PyObject_Vectorcall
45: 0x7ffe16f44cb5 - PyObject_Vectorcall
46: 0x7ffe16f461ad - PyEval_EvalFrameDefault
47: 0x7ffe16f4432c - PyFunction_Vectorcall
48: 0x7ffe16f3f2a4 - PyArg_CheckPositional
49: 0x7ffe16f253df - PyObject_Call
50: 0x7ffe16f252d3 - PyObject_Call
51: 0x7ffe16f49a7f - PyEval_EvalFrameDefault
52: 0x7ffe16f4432c - PyFunction_Vectorcall
53: 0x7ffe16f27313 - PyObject_FastCallDictTstate
54: 0x7ffe1704fbab - PyObject_Call_Prepend
55: 0x7ffe1704fad6 - PyBytesWriter_Resize
56: 0x7ffe16f45302 - PyObject_Vectorcall
57: 0x7ffe16f44cb5 - PyObject_Vectorcall
58: 0x7ffe16f461ad - PyEval_EvalFrameDefault
59: 0x7ffe16f84d18 - PyEval_EvalCode
60: 0x7ffe16f84689 - PyEval_EvalCode
61: 0x7ffe16f076a4 - PyObject_CallMethodObjArgs
62: 0x7ffe16f07768 - PyObject_CallMethodObjArgs
63: 0x7ffe16f68320 - PyImport_FixupExtensionObject
64: 0x7ffe16f67f1a - PyRun_SimpleFileObject
65: 0x7ffe16f67579 - PyRun_AnyFileObject
66: 0x7ffe16f673be - PyFile_GetLine
67: 0x7ffe16f6729b - PyFile_GetLine
68: 0x7ffe16fbcfcb - Py_RunMain
69: 0x7ffe16fbcd7d - Py_RunMain
70: 0x7ffe16fa9e2d - Py_Main
71: 0x7ff78d891230 -
72: 0x7ffef18a259d - BaseThreadInitThunk
73: 0x7ffef23caf38 - RtlUserThreadStart
thread caused non-unwinding panic. aborting.

I replaced the copas cudart cufft cufft cufft cufftw cusparse cvrtc of pytorch with the dll of zluda, respectively
I tried setting the 'RUST_BACKTRACE=1' environment variable, but it didn't work

import torch

torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)

_topk = torch.topk
def topk(tensor: torch.Tensor, *args, **kwargs):
    device = tensor.device
    values, indices = _topk(tensor.cpu(), *args, **kwargs)
    return torch.return_types.topk((values.to(device), indices.to(device),))
torch.topk = topk

from ultralytics import YOLO

model = YOLO("yolo11n.pt")

results = model.train(data="coco8.yaml", epochs=100, imgsz=640)

The code above is the code I tried.

@lshqqytiger
Copy link
Owner

lshqqytiger commented Jan 27, 2025

Please try again using this dll file.
cudart.zip

@mark4653
Copy link
Author

mark4653 commented Jan 28, 2025

File "C:\Users\mark4653\Documents\python\Lib\site-packages\ultralytics\utils\ops.py", line 61, in time
torch.cuda.synchronize(self.device)
File "C:\Users\mark4653\Documents\python\Lib\site-packages\torch\cuda_init_.py", line 954, in synchronize
return torch._C._cuda_synchronize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I tested it, but it didn't work this time due to the above error;;
Pytorch is installed as Cu118 version

@lshqqytiger
Copy link
Owner

lshqqytiger commented Jan 28, 2025

Please try this one.
cudart.zip

@mark4653
Copy link
Author

File "C:\Users\mark4653\Documents\python\Lib\site-packages\ultralytics\models\yolo\detect\predict.py", line 25, in postprocess
preds = ops.non_max_suppression(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mark4653\Documents\python\Lib\site-packages\ultralytics\utils\ops.py", line 269, in non_max_suppression
x = x[xc[xi]] # confidence
~^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.97 GiB. GPU 0 has a total capacity of 11.98 GiB of which 11.76 GiB is free. Of the allocated memory 63.71 MiB is allocated by PyTorch, and 30.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The out of memory error seems to be working, but there are some questions. The error "Tried to allocate 15.97 GiB." has occurred, and the code I tested is the code that teaches 4 pictures to yolov11n, the smallest model of yolov11, and it's not the size to allocate 16gb of vram, and in the colab environment where I tested the code before, I completed the test normally even though the vram was 15gb. I also consumed the actual vram

@lshqqytiger
Copy link
Owner

lshqqytiger commented Jan 30, 2025

Try disabling caching

results = model.train(data="coco8.yaml", epochs=100, imgsz=640, cache=False)

@mark4653
Copy link
Author

mark4653 commented Jan 30, 2025

The result seems to be the same because the default value of the cache parameter is false.

It seems to be related to the 'Automatic Mixed Precision' function as it progresses when the amp parameter is changed to false
If you modify the parameters and run them

Starting training for 100 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

0%| | 0/1 [00:00<?, ?it/s]Exception Code: 0xC0000005
0x00007FF9F2E46DC3, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x886DC3 byte(s), hipMemUnmap() + 0x4D4AD3 byte(s)
0x00007FF9F2A667E7, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x4A67E7 byte(s), hipMemUnmap() + 0xF44F7 byte(s)
0x00007FF9F2A52604, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x492604 byte(s), hipMemUnmap() + 0xE0314 byte(s)
0x00007FF9F2A3FE50, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x47FE50 byte(s), hipMemUnmap() + 0xCDB60 byte(s)
0x00007FF9F29CE452, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x40E452 byte(s), hipMemUnmap() + 0x5C162 byte(s)
0x00007FF9F29CE561, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x40E561 byte(s), hipMemUnmap() + 0x5C271 byte(s)
0x00007FF9F2973EC7, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x3B3EC7 byte(s), hipMemUnmap() + 0x1BD7 byte(s)
0x00007FF9F297644F, C:\Windows\SYSTEM32\amdhip64.dll(0x00007FF9F25C0000) + 0x3B644F byte(s), hipMemUnmap() + 0x415F byte(s)
0x00007FFAF8E7259D, C:\Windows\System32\KERNEL32.DLL(0x00007FFAF8E60000) + 0x1259D byte(s), BaseThreadInitThunk() + 0x1D byte(s)

It is predicted that a problem occurred in the call part of the hipMemUNmap() function, but the prompt did not tell me the problem, so I don't know the details

@lshqqytiger
Copy link
Owner

lshqqytiger commented Jan 30, 2025

0xC0000005 is memory access violation, and it is out of memory in some cases. I'm not sure whether it is or not in this case.
If it is not an out of memory error, it probably is a driver related thing.
What version of HIP SDK do you have? I don't have amdhip64.dll in System32, but amdhip64_6.dll in my HIP SDK folder.

@mark4653
Copy link
Author

hip SDK 5.7.0 (Core, Library Development, Library Runtime, Ray Tracing Development, Ray Tracing Runtime, Runtime Compiler Development, Runtime Compiler Runtime) is currently installed

@lshqqytiger lshqqytiger added the implementation Unimplemented feature(s) label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implementation Unimplemented feature(s)
Projects
None yet
Development

No branches or pull requests

2 participants