`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621

THargreaves · 2025-01-07T11:21:03Z

Small bug that can lead to massive slowdowns.

Minimal example:

A = CUDA.rand(TA, 2, 2, 100)
B = CUDA.rand(TB, 2, 2, 100)
CUDA.@profile NNlib.batched_mul(A, B)

For TA = TB = Float32, we have expected results


Profiler ran for 86.78 µs, capturing 32 events.

Host-side activity: calling CUDA APIs took 27.18 µs (31.32% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                   │
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────┤
│   20.33% │   17.64 µs │     1 │                                      │ cudaLaunchKernel                                       │
│    6.04% │    5.25 µs │     3 │   1.75 µs ± 1.85   (  0.24 ‥ 3.81)   │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    3.57% │     3.1 µs │     1 │                                      │ cuMemAllocFromPoolAsync                                │
│    0.27% │  238.42 ns │     2 │ 119.21 ns ± 168.59 (   0.0 ‥ 238.42) │ cudaGetLastError                                       │
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 1.91 µs (2.20% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Name                                                                                                                                                                                                                                                        ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.20% │    1.91 µs │     1 │ void gemmSN_NN_kernel<double, 256, 4, 2, 8, 2, 4, false, cublasGemvTensorStridedBatched<double const>, cublasGemvTensorStridedBatched<double const>, cublasGemvTensorStridedBatched<double>>(cublasGemmSmallNParams<cublasGemvTensorStridedBatched<double c ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

When the types differ, rather than promoting, the generic method is called which launches one kernel per matmul and is incredibly slow.

Profiler ran for 6.41 ms, capturing 3310 events.

Host-side activity: calling CUDA APIs took 354.53 µs (5.53% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│    3.64% │  233.65 µs │   100 │   2.34 µs ± 2.66   (  1.67 ‥ 28.13) │ cuLaunchKernel          │
│    0.11% │    6.91 µs │     1 │                                     │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 162.6 µs (2.54% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                                                                                                   ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.54% │   162.6 µs │   100 │   1.63 µs ± 0.17   (  1.43 ‥ 1.91) │ gpu_matmatmul_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

mcabbott · 2025-01-07T18:16:06Z

What are the types in your second example?

One quick solution is to simply make the slow path an error, which #612 allows. If you accidentally have Float64 showing up, you really want to fix it, not for batched_mul to choose the least-awful path.

THargreaves · 2025-01-07T18:51:12Z

What are the types in your second example?

Ah sorry. They were Float32 and Float64.

If you accidentally have Float64 showing up, you really want to fix it, not for batched_mul to choose the least-awful path.

I actually discovered this due to the reverse. Code I wanted to be running in double precision but had Float32s being introduced since this is the default type used by CUDA.rand when a type isn't specified.

But yes, I agree with the sentiment, and that seems like a sensible approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621

`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621

THargreaves commented Jan 7, 2025

mcabbott commented Jan 7, 2025

THargreaves commented Jan 7, 2025

batched_mul launches one kernel per matrix when eltypes differ rather than promoting #621

batched_mul launches one kernel per matrix when eltypes differ rather than promoting #621

Comments

THargreaves commented Jan 7, 2025

mcabbott commented Jan 7, 2025

THargreaves commented Jan 7, 2025

`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621

`batched_mul` launches one kernel per matrix when eltypes differ rather than promoting #621