This repository is a good starting point to learn about CUDA basics. Details about CUDA specific syntaxes are added as comments. Currently I have covered vector addition and 2D convolution operation.
- Vector Addition - Baseline, Pinned Memory, Unified Memory
- Convolution - Baseline, Tiled, cuDNN
- Requires CUDA installed in the machine. We use CUDA v11.7
- To compile the program (program.cu) use
nvcc program.cu -o program
- To compile a program using cuDNN library
nvcc program.cu -lcudnn -o program
- To profile program.cu using nvidia profiler
nvprof ./program [arg1] [arg2] [arg3]
Number of arguments are program specific. Go through the code to know more.
Tested the code on following cases:
- 1 block, 1 thread : slight slow down compared to CPU due to lower clock speeds.
- 1 block, 256 threads : Speedup > 10
- n blocks, 256 threads : Speedup > 100 (n is selected such that n * 256 = size of vector)
Hardware: Intel Xeon, NVIDIA Tesla P4
Algorithm | Runtime (in ms) | Speedup |
---|---|---|
Baseline | 88.626 | - |
Tiled | 74.520 | 1.1892 |
CuDNN Library | 35.914 | 2.4677 |