CUDA error #4

OpenAskDragon · 2024-12-12T03:03:22Z

Hello, when I finish compiling and run the program, the following error occurs:

Error: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Exception raised from gemm<float> at ../aten/src/ATen/cuda/CUDABlas.cpp:427 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f7d9286e38b in /home/zwl/SLAM_package/libtorch211/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7f7d92868f3f in /home/zwl/SLAM_package/libtorch211/lib/libc10.so)
frame #2: <unknown function> + 0x31b158b (0x7f7d961b158b in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x31e3b45 (0x7f7d961e3b45 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x2f6d668 (0x7f7d95f6d668 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cuda.so)
frame #5: at::_ops::addmm_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0xa7 (0x7f7de66052c7 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x43d29be (0x7f7de8bd29be in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #7: at::_ops::addmm_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0xa7 (0x7f7de66052c7 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x3ba7952 (0x7f7de83a7952 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #9: at::_ops::addmm_::call(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a3 (0x7f7de666b903 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x110b73 (0x5594c33a9b73 in ./LightGlue)
frame #11: <unknown function> + 0x1080c7 (0x5594c33a10c7 in ./LightGlue)
frame #12: <unknown function> + 0x117808 (0x5594c33b0808 in ./LightGlue)
frame #13: <unknown function> + 0x118dbb (0x5594c33b1dbb in ./LightGlue)
frame #14: <unknown function> + 0x118556 (0x5594c33b1556 in ./LightGlue)
frame #15: <unknown function> + 0x117887 (0x5594c33b0887 in ./LightGlue)
frame #16: <unknown function> + 0x466456d (0x7f7de8e6456d in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x466283c (0x7f7de8e6283c in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x1403d79 (0x7f7de5c03d79 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0xceb1f (0x5594c3367b1f in ./LightGlue)
frame #20: <unknown function> + 0xdbaa9 (0x5594c3374aa9 in ./LightGlue)
frame #21: <unknown function> + 0xc889c (0x5594c336189c in ./LightGlue)
frame #22: <unknown function> + 0xbbd46 (0x5594c3354d46 in ./LightGlue)
frame #23: <unknown function> + 0xbe548 (0x5594c3357548 in ./LightGlue)
frame #24: <unknown function> + 0x4648a (0x5594c32df48a in ./LightGlue)
frame #25: <unknown function> + 0x47455 (0x5594c32e0455 in ./LightGlue)
frame #26: <unknown function> + 0x47f07 (0x5594c32e0f07 in ./LightGlue)
frame #27: <unknown function> + 0x1fd98 (0x5594c32b8d98 in ./LightGlue)
frame #28: <unknown function> + 0x29d90 (0x7f7d8a593d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: __libc_start_main + 0x80 (0x7f7d8a593e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #30: <unknown function> + 0x1e3b5 (0x5594c32b73b5 in ./LightGlue)

When I run with aliked-n16.pt and aliked_lightglue.pt, the following warning appears:

[W TensorShape.cpp:3527] Warning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (function operator())
Warning: confidence_thresholds not found in model parameters or buffers

My CUDA version is 12.1, libtorch version is 2.1.1, and the compiler used is gcc 11.4.0.
When I debug, an error is reported when this function is executed.

        at::Tensor deform_conv2d(
            const at::Tensor& input,
            const at::Tensor& weight,
            const at::Tensor& offset,
            const at::Tensor& mask,
            const at::Tensor& bias,
            int64_t stride_h,
            int64_t stride_w,
            int64_t pad_h,
            int64_t pad_w,
            int64_t dilation_h,
            int64_t dilation_w,
            int64_t groups,
            int64_t offset_groups,
            bool use_mask) {
            C10_LOG_API_USAGE_ONCE("torchvision.csrc.ops.deform_conv2d.deform_conv2d");
            static auto op = c10::Dispatcher::singleton()
                                 .findSchemaOrThrow("torchvision::deform_conv2d", "")
                                 .typed<decltype(deform_conv2d)>();
            return op.call(
                input,
                weight,
                offset,
                mask,
                bias,
                stride_h,
                stride_w,
                pad_h,
                pad_w,
                dilation_h,
                dilation_w,
                groups,
                offset_groups,
                use_mask);
        }

Could you please tell me where the issue might be?

MrNeRF · 2024-12-12T09:59:17Z

I had not issues with this checkpoint.
If you provide the images you used, I can try to debug it.

OpenAskDragon · 2024-12-12T11:24:55Z

Hello, I tried testing with images from the KITTI dataset, but the issue mentioned above still persists. I'm not sure where the problem lies in my code environment.

OpenAskDragon · 2024-12-12T11:31:35Z

Hello, could you please provide the CUDA version and LibTorch version you are using to run the code?

MrNeRF · 2024-12-12T12:30:29Z

I don't have time to test it now again but I also use CUDA 12.1 and https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu121.zip.

I will look into it tomorrow likely. Last time I tested it worked. Anyway, must not hold true for the latest state.

OpenAskDragon · 2024-12-12T14:10:46Z

Hello, my environment is the same as yours.(CUDA 12.1 libtorch 2.5.1) Could you tell me if there are any requirements for the image size in this algorithm?

dong-won-shin · 2025-01-07T01:55:09Z

I have the same issue.

Ubuntu 20.04
NVIDIA RTX 3080
CUDA 12.1
libtorch-cxx11-abi-shared-with-deps-2.5.1+cu121
docker environment

Master-cai mentioned this issue Dec 13, 2024

Compilation Error with Opencv #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error #4

CUDA error #4

OpenAskDragon commented Dec 12, 2024

MrNeRF commented Dec 12, 2024

OpenAskDragon commented Dec 12, 2024

OpenAskDragon commented Dec 12, 2024

MrNeRF commented Dec 12, 2024

OpenAskDragon commented Dec 12, 2024

dong-won-shin commented Jan 7, 2025 •

edited

Loading

CUDA error #4

CUDA error #4

Comments

OpenAskDragon commented Dec 12, 2024

MrNeRF commented Dec 12, 2024

OpenAskDragon commented Dec 12, 2024

OpenAskDragon commented Dec 12, 2024

MrNeRF commented Dec 12, 2024

OpenAskDragon commented Dec 12, 2024

dong-won-shin commented Jan 7, 2025 • edited Loading

dong-won-shin commented Jan 7, 2025 •

edited

Loading