Skip to content

Latest commit

 

History

History

onnx-benchmark

NPU Benchmark Tool with ONNXRT for Windows OS

Table of Contents

RYZENAI-ONNX-CNNs-BENCHMARK

1 Introduction

The NPU benchmark tool is designed to measure the performance of ONNX-based Convolutional Neural Network (CNN) inference models. Ryzen AI processors combine Ryzen processor cores (CPU) with an AMD Radeon™ graphics engine (GPU) and a dedicated AI engine (NPU).

The 'performance_benchmark.py' tool is capable of reporting:

  1. Throughput, measured in frames per second
  2. Latency, measured in milliseconds

1.1 Notes for this release

Windows release supporting PHOENIX and STRIX devices.

2 Setup

2.1 Install Procedure

Pre requisite: have Anaconda or Miniconda installed. It is advisable to create two separate Conda environments, one for the external GPU EP (if you have one) and another for the VitisAI EP. The following guideline allows for the creation of a Conda environment with CPU and VitisAI Execution Providers only.

Follow the instructions from the official RyzenAI 1.3.0 Installation.

In the command below, please change the Conda environment name and installation path if you have altered them from the default settings.

When the installation is completed, assuming for example that

  • the generated Conda environment is named ryzen-ai-1.3.0,
  • the installation is placed in "C:\ProgramFiles\ryzen-ai\1.3.0" (note: the space between Program Files has been removed as it generates errors in some cases)
set RYZEN_AI_CONDA_ENV_NAME=ryzen-ai-1.3.0
set RYZEN_AI_INSTALLATION_PATH=C:\ProgramFiles\RyzenAI\1.3.0

add some needed packages:

conda activate %RYZEN_AI_CONDA_ENV_NAME%
python -m pip install -r requirements-win.txt

Set some permanent variables:

In case of STRIX device:

conda env config vars set XLNX_VART_FIRMWARE="%RYZEN_AI_INSTALLATION_PATH%"\voe-4.0-win_amd64\xclbins\strix\AMD_AIE2P_Nx4_Overlay.xclbin
conda env config vars set XLNX_TARGET_NAME=AMD_AIE2P_Nx4_Overlay
conda env config vars set XCLBINHOME="%RYZEN_AI_INSTALLATION_PATH%"\voe-4.0-win_amd64\xclbins\strix
conda env config vars set VAIP_CONFIG_HOME="%RYZEN_AI_INSTALLATION_PATH%"\voe-4.0-win_amd64

conda deactivate
conda activate %RYZEN_AI_CONDA_ENV_NAME%
conda env config vars list

In case of PHOENIX device:

conda env config vars set XLNX_VART_FIRMWARE="%RYZEN_AI_INSTALLATION_PATH%"\voe-4.0-win_amd64\xclbins\phoenix\1x4.xclbin
conda env config vars set XLNX_TARGET_NAME=AMD_AIE2_Nx4_Overlay
conda env config vars set XCLBINHOME="%RYZEN_AI_INSTALLATION_PATH%"\voe-4.0-win_amd64\xclbins\phoenix
conda env config vars set VAIP_CONFIG_HOME="%RYZEN_AI_INSTALLATION_PATH%"\voe-4.0-win_amd64

conda deactivate
conda activate %RYZEN_AI_CONDA_ENV_NAME%
conda env config vars list

3 ONNX Benchmark

The performance_benchmark.py script can measure the performance and the throughput of ONNX-based CNN inference models running on the Ryzen AI device.

3.1 Parameters

The syntax to call the performance_benchmark.py script is the following:

usage: performance_benchmark.py 
[-h] [--batchsize BATCHSIZE] [--calib CALIB][--config CONFIG] [--core {STX_1x4,STX_4x4}] 
[--execution_provider {CPU,VitisAIEP,iGPU,dGPU}] [--infinite {0,1}][--instance_count INSTANCE_COUNT] [--intra_op_num_threads INTRA_OP_NUM_THREADS] [--json JSON] [--log_csv LOG_CSV] [--min_interval MIN_INTERVAL] [--model MODEL][--no_inference {0,1}] [--nhwc {0,1}] [--num NUM] [--num_calib NUM_CALIB] [--profile {0,1}] [--renew {0,1}] [--timelimit TIMELIMIT] [--threads THREADS][--update_opset {0,1}] [--verbose {0,1,2}] [--warmup WARMUP]

where all the parameters are checked in the utilities.py script:

-h, --help            show this help message and exit
--batchsize BATCHSIZE, -b BATCHSIZE
  batch size: number of images processed at the same time by the model. VitisAIEP supports batchsize = 1. Default is 1
--calib CALIB         path to Imagenet database, used for quantization with calibration. Default= .\Imagenet al
--config CONFIG, -c CONFIG
  path to config json file. Default= <release>\vaip_config.json
--core {STX_1x4,STX_4x4}
  Which core to use with STRIX silicon. Default=STX_1x4
--execution_provider {CPU,VitisAIEP,iGPU,dGPU}, -e {CPU,VitisAIEP,iGPU,dGPU}
  Execution Provider selection. Default=CPU
--infinite {0,1}      if 1: Executing an infinite loop, when combined with a time limit, enables the test to run for a specified duration. Default=1
--instance_count INSTANCE_COUNT, -i INSTANCE_COUNT
  This parameter governs the parallelism of job execution. When the Vitis AI EP is selected, this parameter controls the number of DPU runners. The workload is always equally  
  divided per each instance count. Default=1
--intra_op_num_threads INTRA_OP_NUM_THREADS
  In general this parameter controls the total number of INTRA threads to use to run the model. INTRA = parallelize computation inside each operator. Specific for VitisAI EP:  
  number of CPU threads enabled when an operator is resolved by the CPU. Affects the performances but also the CPU power consumption. For best performance: set
  intra_op_num_threads to 0: INTRA Threads Total = Number of physical CPU Cores. For best power efficiency: set intra_op_num_threads to smallest number (>0) that sustains the  
  close to optimal performance (likely 1 for most cases of models running on DPU). Default=1
--json JSON           Path to the file of parameters.
--log_csv LOG_CSV, -k LOG_CSV
  If this option is set to 1, measurement data will appended to a CSV file. Default=0
--min_interval MIN_INTERVAL
  Minimum time interval (s) for running inferences. Default=0
--model MODEL, -m MODEL
  Path to the ONNX model
--no_inference {0,1}  When set to 1 the benchmark runs without inference for power measurements baseline. Default=0
--nhwc {0,1}          When set to 1 enables the input tensor conversion from NCHW to NHWC. Default=0
--num NUM, -n NUM     The number of images loaded into memory and subsequently sent to the model. Default=100
--num_calib NUM_CALIB
  The number of images for calibration. Default=10
--profile {0,1}       if this option is set to 1 the ONNX profiler is enabled. Default = 0
--renew {0,1}, -r {0,1}
  if set to 1 cancel the cache and recompile the model. Set to 0 to keep the old compiled file. Default=1
--timelimit TIMELIMIT, -l TIMELIMIT
  When used in conjunction with the --infinite option, it represents the maximum duration of the experiment. The default value is set to 10 seconds.
--threads THREADS, -t THREADS
  CPU threads. Default=1
--update_opset {0,1}  if set to 1 and the model has opset < 11, automatically update the model opset to 17. Default=0
--verbose {0,1,2}, -v {0,1,2}
  0 (default): no debug messages, 1: few debug messages, 2: all debug messages
--warmup WARMUP, -w WARMUP
  Perform warmup runs, default = 40

3.2 Measurement Report

The report consolidates performance measurements along with system, resources, and environmental snapshots into either a JSON or CSV file. Each measurement is automatically saved in the file report_performance.json, along with the testing configuration. Each measurement can be appended to a CSV file with the option --log_csv or -k.

3.3 Example Usage

3.3.1 GUI for performance benchmarking

This user-friendly graphical interface is designed to assist users in composing command lines and launching programs with all parameters preset. All option defaults are preloaded. The GUI adapts and automatically updates the available options based on the detected processor (PHOENIX or STRIX).

python test.py

GUI!

GUI!

GUI!

3.3.2 Performance with a single instance using a CPU

This experiment involves repeatedly performing inferences on a batch of 100 images -n 100, sending one image at a time to the CPU -e CPU, where a FP32 Resnet50 model is used -m .\models\resnet50\resnet50_fp32.onnx

python performance_benchmark.py -m .\models\resnet50\resnet50_fp32.onnx -n 100 -e CPU

3.3.3 Performance with a single instance utilizing NPU and model quantization:

To ensure smooth execution of this example, it is advisable to download a set of Imagenet pictures in advance, typically around 100 images should suffice. When the VitisAI EP is chosen and the model is in FP32 format, the tool endeavors to quantize the model before conducting tests. It is imperative to specify the path to a folder containing images using the --calib ... parameter. Additionally, --num_images 10 randomly selects and copies ten images from the dataset into a calibration folder.

python performance_benchmark.py -m .\models\resnet50\resnet50_fp32.onnx -n 100 -e VitisAIEP --calib .\<images folder> --num_calib 10

Please be aware that a new model with the suffix _int8 will be generated in the same folder as the original FP32 model.

3.3.4 Performance with single instance with NPU

This experiment involves repeatedly performing inferences on a batch of 100 images -n 100, sending one image at a time to the NPU -e VitisAIEP: most operators are processed by the NPU. When the NPU is used a configuration file is loaded from the installation folder by default, or a custom configuration can be assigned if available: --config .\models\resnet50\vaip_config_custom.json. The compiler will load the quantized model -m .\models\resnet50\resnet50_fp32_qdq.onnx, and store the compiled model in a cache folder. Finally, the images are sent to the compiled model to be processed. The provided information includes the ratio of operators assigned to the CPU and those assigned to the NPU.

python performance_benchmark.py -m .\models\resnet50\resnet50_fp32_qdq.onnx -n 100 -e VitisAIEP

3.3.5 Best Throughput: performance with eight instances of 1x4 core

The NPU four rows x four columns core is optimized for latency. The one row x four columns core is optimized for throughput and flexibility. The best throughput is achieved when eight instances of 1x4 are used. It is recommended to use more than eight threads for optimal performance. Fore PHOENIX, no more than four instances can be used. In this STRIX experiment, we employ eight models operating in parallel, as opposed to just one as seen in previous examples. This enables us to enhance the throughput, allowing for a higher number of images processed per unit of time. The presence of eight instances of the model (specified by -i 8) must also be accompanied by an increased number of CPU threads (specified by -t 12). Force recompiling: if the model has already been compiled and the cache is present, the compiler will reuse the cached model to save time. However, if configuration files, xclbin, or the Anaconda environment have changed, it is necessary to clear the cache to recompile the model. Please use the option -r 1 to clear the cache.

python performance_benchmark.py -m .\models\resnet50\resnet50_fp32_qdq.onnx -n 100 -e VitisAIEP -i 8 -t 12 -r 1

3.3.6 Best Latency: performance with one instance of 4x4 core

The NPU four rows x four columns core is optimized for latency. The one row x four columns core is optimized for throughput and flexibility. In this STRIX example, the 4x4 core is used --core STX_4x4. If you have been following the sequence of experiments, previously the model was compiled for the default STX_1x4 core configuration. Therefore, the cache needs to be cleared using -r 1, and the model needs to be recompiled for the new core configuration.

python performance_benchmark.py -m .\models\resnet50\resnet50_fp32_qdq.onnx -n 100 -e VitisAIEP -i 1 -t 1 -r 1 --core STX_4x4

3.3.7 Performance with single instance with CPU at 30fps fixed rate

In certain scenarios, it might be necessary to fine-tune the minimum latency, and this can be accomplished through the --min_interval option. For instance, adding a delay of 0.033 seconds is employed to simulate live video at a consistent 30 frames per second. Thus, we set the delay to 1/30 (0.033 s). In this context, the option is utilized alongside multithreading to ensure the CPU can maintain the desired framerate, even if the latency exceeds 33 milliseconds.

python performance_benchmark.py -m .\models\resnet50\resnet50_fp32.onnx -n 100 -e CPU --min_interval 0.033

3.3.8 File of parameters

As the number of input parameters increases, it can become convenient to supply a file containing all of them. This method aids in consistently reproducing the same experiment. For instance, the two cases below are optimized for either low latency or high throughput. The file of parameters overrides all other parameters.

python performance_benchmark.py --json .\test\STX_resnet50_high_throughput.json
python performance_benchmark.py --json .\test\STX_resnet50_low_latency.json

4 Additional features

4.1 Auto quantization

If the model benchmarked with VitisAI EP is not quantized, the benchmarking process will automatically apply quantization using the NHWC order. A notification in magenta will be displayed during the quantization process. Prior to quantization, the user is required to download some images, for example from the ImageNet database and specify the path to its location. Once quantization is completed, a new model named int8.onnx will be saved in the current directory and tested.

python performance_benchmark.py -m ".\models\resnet50\resnet50_fp32.onnx" -e VitisAIEP --calib <images folder> --num_calib 10

4.2 Tricks

Once executed in the GUI, the equivalent command line is copied to the clipboard, making it easy to paste into the terminal.