Skip to content

🤖️ Optimized CUDA Kernels for Fast MobileNetV2 Inference

License

Notifications You must be signed in to change notification settings

zhliuworks/Fast-MobileNetV2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast-MobileNetV2

Optimized CUDA Kernels for Fast MobileNetV2 Inference

Develop Steps

  • ① Implement MobileNetV2 with PyTorch, and parse the given ONNX model with Python to analyze the network structure. --- mobilenet_v2/nn/onnx/
  • ② Implement MobileNetV2 with C++ (only sequential layer structures and weights, no forward computation), and parse the given ONNX model with Python to extract the weights. --- mobilenet_v2/nn/
  • ③ Implement wrappers and tests for cuDNN/cuBLAS primitives: Conv, Gemm, and Pool. --- mobilenet_v2/cudnn/
    • Here, Gemm can be implemented using cuBLAS, or seen as 1x1 Conv2d using cuDNN, we take the former way)
  • ④ Implement cuDNN-accelerated MobileNetV2 with wrappers and C++ network implemented above. --- mobilenet_v2/cudnn/
  • ⑤ Implement and optimize CUDA kernels: Conv, Gemm, and Pool. --- mobilenet_v2/fast_mobilenet/
    • Here, Conv can be implemented using Im2Col + Gemm, or Winograd Algorithm (we only implemented the former)
  • ⑥ Implement our Fast-MobileNetV2 as a whole. --- mobilenet_v2/fast_mobilenet/
  • ⑦ Compare and Optimize: e.g. parameters tuning, model-specific / hardware-specific optimization, ...

Test Steps

nn

  • Re-implement MobileNetV2 ONNX model with PyTorch and test inference:

    (conda) >> cd mobilenet_v2/nn/onnx/
    (conda) >> python pytorchMobileNetV2.py
  • Save weights in MobileNetV2 ONNX model to plain-text files:

    (conda) >> cd mobilenet_v2/nn/weights/
    (conda) >> python save_weights.py
  • Show MobileNetV2 topology in C++ and check loaded weights:

    >> cd mobilenet_v2/nn/examples/
    >> make show
    >> ./show.out
    >> make check
    >> ./check.out

cudnn

  • Show version of CUDA and CUDNN:

    >> cd mobilenet_v2/cudnn/
    >> bash version.sh
  • Operator tests:

    >> cd mobilenet_v2/cudnn/tests/test_op/
    >> make
    >> ./testConv.o
    >> ./testGemm.o
    >> ./testPool.o
    >> ./testAdd.o
  • Network test:

    (conda) >> cd mobilenet_v2/cudnn/tests/test_net/
    (conda) >> python generate_data.py
    (conda) >> conda deactivate
    >> make
    >> ./testCudnnMobileNetV2.o
    >> source ~/.bashrc
    (conda) >> python compare_cudnn_onnx.py

our kernels

  • Operator tests:

    >> cd mobilenet_v2/fast_mobilenet/tests/test_op/
    >> make
    >> ./testConv.o
    >> ./testGemm.o
    >> ./testPool.o
    >> ./testAdd.o
    >> ./testIm2Col.o
  • Network test:

    (conda) >> cd mobilenet_v2/fast_mobilenet/tests/test_net/
    (conda) >> python generate_data.py
    (conda) >> conda deactivate
    >> make
    >> ./testFastMobileNetV2.o
    >> source ~/.bashrc
    (conda) >> python compare_fast_onnx.py

Test Environment

  • NVIDIA Tesla V100 GPU
  • CUDA version 10.2.89
  • CUDNN version 8.2.4
  • Run Python source of this repo in an Anaconda environment, and we have Python version 3.9.7
  • Do NOT Run CUDA source of this repo in an Anaconda environment

Tech Stack

Reference

[1] Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018.

[2] NVIDIA Corporation. "NVIDIA cuDNN Documentation." available at: https://docs.nvidia.com/deeplearning/cudnn/api/index.html

[3] NVIDIA Corporation. "NVIDIA cuBLAS Documentation." available at: https://docs.nvidia.com/cuda/cublas/index.html

[4] Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2016.

[5] Mark Harris. "CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops." available at: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

[6] Mark Harris. "Optimizing Parallel Reduction in CUDA." available at: https://vuduc.org/teaching/cse6230-hpcta-fa12/slides/cse6230-fa12--05b-reduction-notes.pdf

About

🤖️ Optimized CUDA Kernels for Fast MobileNetV2 Inference

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published