Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batched_mul launches one kernel per matrix when eltypes differ rather than promoting #621

Open
THargreaves opened this issue Jan 7, 2025 · 2 comments

Comments

@THargreaves
Copy link

Small bug that can lead to massive slowdowns.

Minimal example:

A = CUDA.rand(TA, 2, 2, 100)
B = CUDA.rand(TB, 2, 2, 100)
CUDA.@profile NNlib.batched_mul(A, B)

For TA = TB = Float32, we have expected results


Profiler ran for 86.78 µs, capturing 32 events.

Host-side activity: calling CUDA APIs took 27.18 µs (31.32% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                   │
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────┤
│   20.33% │   17.64 µs │     1 │                                      │ cudaLaunchKernel                                       │
│    6.04% │    5.25 µs │     3 │   1.75 µs ± 1.85   (  0.24 ‥ 3.81)   │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    3.57% │     3.1 µs │     1 │                                      │ cuMemAllocFromPoolAsync                                │
│    0.27% │  238.42 ns │     2 │ 119.21 ns ± 168.59 (   0.0 ‥ 238.42) │ cudaGetLastError                                       │
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 1.91 µs (2.20% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Name                                                                                                                                                                                                                                                        ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.20% │    1.91 µs │     1 │ void gemmSN_NN_kernel<double, 256, 4, 2, 8, 2, 4, false, cublasGemvTensorStridedBatched<double const>, cublasGemvTensorStridedBatched<double const>, cublasGemvTensorStridedBatched<double>>(cublasGemmSmallNParams<cublasGemvTensorStridedBatched<double c ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         

When the types differ, rather than promoting, the generic method is called which launches one kernel per matmul and is incredibly slow.

Profiler ran for 6.41 ms, capturing 3310 events.

Host-side activity: calling CUDA APIs took 354.53 µs (5.53% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│    3.64% │  233.65 µs │   100 │   2.34 µs ± 2.66   (  1.67 ‥ 28.13) │ cuLaunchKernel          │
│    0.11% │    6.91 µs │     1 │                                     │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 162.6 µs (2.54% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                                                                                                   ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.54% │   162.6 µs │   100 │   1.63 µs ± 0.17   (  1.43 ‥ 1.91) │ gpu_matmatmul_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    
@mcabbott
Copy link
Member

mcabbott commented Jan 7, 2025

What are the types in your second example?

One quick solution is to simply make the slow path an error, which #612 allows. If you accidentally have Float64 showing up, you really want to fix it, not for batched_mul to choose the least-awful path.

@THargreaves
Copy link
Author

What are the types in your second example?

Ah sorry. They were Float32 and Float64.

If you accidentally have Float64 showing up, you really want to fix it, not for batched_mul to choose the least-awful path.

I actually discovered this due to the reverse. Code I wanted to be running in double precision but had Float32s being introduced since this is the default type used by CUDA.rand when a type isn't specified.

But yes, I agree with the sentiment, and that seems like a sensible approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants