Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error #4

Open
OpenAskDragon opened this issue Dec 12, 2024 · 6 comments
Open

CUDA error #4

OpenAskDragon opened this issue Dec 12, 2024 · 6 comments

Comments

@OpenAskDragon
Copy link

Hello, when I finish compiling and run the program, the following error occurs:

Error: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Exception raised from gemm<float> at ../aten/src/ATen/cuda/CUDABlas.cpp:427 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f7d9286e38b in /home/zwl/SLAM_package/libtorch211/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7f7d92868f3f in /home/zwl/SLAM_package/libtorch211/lib/libc10.so)
frame #2: <unknown function> + 0x31b158b (0x7f7d961b158b in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x31e3b45 (0x7f7d961e3b45 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x2f6d668 (0x7f7d95f6d668 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cuda.so)
frame #5: at::_ops::addmm_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0xa7 (0x7f7de66052c7 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x43d29be (0x7f7de8bd29be in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #7: at::_ops::addmm_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0xa7 (0x7f7de66052c7 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x3ba7952 (0x7f7de83a7952 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #9: at::_ops::addmm_::call(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a3 (0x7f7de666b903 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x110b73 (0x5594c33a9b73 in ./LightGlue)
frame #11: <unknown function> + 0x1080c7 (0x5594c33a10c7 in ./LightGlue)
frame #12: <unknown function> + 0x117808 (0x5594c33b0808 in ./LightGlue)
frame #13: <unknown function> + 0x118dbb (0x5594c33b1dbb in ./LightGlue)
frame #14: <unknown function> + 0x118556 (0x5594c33b1556 in ./LightGlue)
frame #15: <unknown function> + 0x117887 (0x5594c33b0887 in ./LightGlue)
frame #16: <unknown function> + 0x466456d (0x7f7de8e6456d in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x466283c (0x7f7de8e6283c in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x1403d79 (0x7f7de5c03d79 in /home/zwl/SLAM_package/libtorch211/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0xceb1f (0x5594c3367b1f in ./LightGlue)
frame #20: <unknown function> + 0xdbaa9 (0x5594c3374aa9 in ./LightGlue)
frame #21: <unknown function> + 0xc889c (0x5594c336189c in ./LightGlue)
frame #22: <unknown function> + 0xbbd46 (0x5594c3354d46 in ./LightGlue)
frame #23: <unknown function> + 0xbe548 (0x5594c3357548 in ./LightGlue)
frame #24: <unknown function> + 0x4648a (0x5594c32df48a in ./LightGlue)
frame #25: <unknown function> + 0x47455 (0x5594c32e0455 in ./LightGlue)
frame #26: <unknown function> + 0x47f07 (0x5594c32e0f07 in ./LightGlue)
frame #27: <unknown function> + 0x1fd98 (0x5594c32b8d98 in ./LightGlue)
frame #28: <unknown function> + 0x29d90 (0x7f7d8a593d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: __libc_start_main + 0x80 (0x7f7d8a593e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #30: <unknown function> + 0x1e3b5 (0x5594c32b73b5 in ./LightGlue)

When I run with aliked-n16.pt and aliked_lightglue.pt, the following warning appears:

[W TensorShape.cpp:3527] Warning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (function operator())
Warning: confidence_thresholds not found in model parameters or buffers

My CUDA version is 12.1, libtorch version is 2.1.1, and the compiler used is gcc 11.4.0.
When I debug, an error is reported when this function is executed.

        at::Tensor deform_conv2d(
            const at::Tensor& input,
            const at::Tensor& weight,
            const at::Tensor& offset,
            const at::Tensor& mask,
            const at::Tensor& bias,
            int64_t stride_h,
            int64_t stride_w,
            int64_t pad_h,
            int64_t pad_w,
            int64_t dilation_h,
            int64_t dilation_w,
            int64_t groups,
            int64_t offset_groups,
            bool use_mask) {
            C10_LOG_API_USAGE_ONCE("torchvision.csrc.ops.deform_conv2d.deform_conv2d");
            static auto op = c10::Dispatcher::singleton()
                                 .findSchemaOrThrow("torchvision::deform_conv2d", "")
                                 .typed<decltype(deform_conv2d)>();
            return op.call(
                input,
                weight,
                offset,
                mask,
                bias,
                stride_h,
                stride_w,
                pad_h,
                pad_w,
                dilation_h,
                dilation_w,
                groups,
                offset_groups,
                use_mask);
        }

Could you please tell me where the issue might be?

@MrNeRF
Copy link
Owner

MrNeRF commented Dec 12, 2024

I had not issues with this checkpoint.
If you provide the images you used, I can try to debug it.

@OpenAskDragon
Copy link
Author

Hello, I tried testing with images from the KITTI dataset, but the issue mentioned above still persists. I'm not sure where the problem lies in my code environment.

@OpenAskDragon
Copy link
Author

Hello, could you please provide the CUDA version and LibTorch version you are using to run the code?

@MrNeRF
Copy link
Owner

MrNeRF commented Dec 12, 2024

I don't have time to test it now again but I also use CUDA 12.1 and https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu121.zip.

I will look into it tomorrow likely. Last time I tested it worked. Anyway, must not hold true for the latest state.

@OpenAskDragon
Copy link
Author

Hello, my environment is the same as yours.(CUDA 12.1 libtorch 2.5.1) Could you tell me if there are any requirements for the image size in this algorithm?

@dong-won-shin
Copy link

dong-won-shin commented Jan 7, 2025

I have the same issue.

  • Ubuntu 20.04
  • NVIDIA RTX 3080
  • CUDA 12.1
  • libtorch-cxx11-abi-shared-with-deps-2.5.1+cu121
  • docker environment

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants