Optimized CUDA Kernels for Fast MobileNetV2 Inference
- ① Implement MobileNetV2 with PyTorch, and parse the given ONNX model with Python to analyze the network structure. ---
- ② Implement MobileNetV2 with C++ (only sequential layer structures and weights, no forward computation), and parse the given ONNX model with Python to extract the weights. ---
- ③ Implement wrappers and tests for cuDNN/cuBLAS primitives: Conv, Gemm, and Pool. ---
- Here, Gemm can be implemented using cuBLAS, or seen as 1x1 Conv2d using cuDNN, we take the former way)
- ④ Implement cuDNN-accelerated MobileNetV2 with wrappers and C++ network implemented above. ---
- ⑤ Implement and optimize CUDA kernels: Conv, Gemm, and Pool. ---
- Here, Conv can be implemented using Im2Col + Gemm, or Winograd Algorithm (we only implemented the former)
- ⑥ Implement our Fast-MobileNetV2 as a whole. ---
- ⑦ Compare and Optimize: e.g. parameters tuning, model-specific / hardware-specific optimization, ...
Re-implement MobileNetV2 ONNX model with PyTorch and test inference:
(conda) >> cd mobilenet_v2/nn/onnx/ (conda) >> python pytorchMobileNetV2.py
Save weights in MobileNetV2 ONNX model to plain-text files:
(conda) >> cd mobilenet_v2/nn/weights/ (conda) >> python save_weights.py
Show MobileNetV2 topology in C++ and check loaded weights:
>> cd mobilenet_v2/nn/examples/ >> make show >> ./show.out >> make check >> ./check.out
Show version of CUDA and CUDNN:
>> cd mobilenet_v2/cudnn/ >> bash version.sh
Operator tests:
>> cd mobilenet_v2/cudnn/tests/test_op/ >> make >> ./testConv.o >> ./testGemm.o >> ./testPool.o >> ./testAdd.o
Network test:
(conda) >> cd mobilenet_v2/cudnn/tests/test_net/ (conda) >> python generate_data.py (conda) >> conda deactivate >> make >> ./testCudnnMobileNetV2.o >> source ~/.bashrc (conda) >> python compare_cudnn_onnx.py
Operator tests:
>> cd mobilenet_v2/fast_mobilenet/tests/test_op/ >> make >> ./testConv.o >> ./testGemm.o >> ./testPool.o >> ./testAdd.o >> ./testIm2Col.o
Network test:
(conda) >> cd mobilenet_v2/fast_mobilenet/tests/test_net/ (conda) >> python generate_data.py (conda) >> conda deactivate >> make >> ./testFastMobileNetV2.o >> source ~/.bashrc (conda) >> python compare_fast_onnx.py
- NVIDIA Tesla V100 GPU
- CUDA version 10.2.89
- CUDNN version 8.2.4
- Run Python source of this repo in an Anaconda environment, and we have Python version 3.9.7
- Do NOT Run CUDA source of this repo in an Anaconda environment
- MobileNetV2: Inverted Residuals and Linear Bottlenecks
- ONNX Python API
- cuDNN and cuBLAS API
- CUDA C++ Programming
- GPU Architecture and Compiler Optimization
