It is assumed that the reader has run the instruction in the AWS SDAccel README successfully in order to be acustomized to the general flow.
This document provides a detailed reference to the SDAccel Development Environment and use for AWS F1 FPGA instances. The SDAccel environment allows kernels expressed in OpenCL or C/C++ to be accelerated by implementing them in custom FPGA hardware. The flexible SDAccel Development Environment also allows the acceleration to be performed using pre-existing RTL designs as well. This Guide provides you with the following concepts and work flows:
- Gain an understanding of the SDAccel Design Flow
- A complete Methodology for using the SDAccel Development Environment effectively
- How to work with examples
- Run a sample design on GUI
- Frequently Asked Questions (FAQ)
SDAccel uses the compiler named xocc
which can be thought of similar to the gnu gcc compiler -i.e. it allows to compile source code to create Xilinx object (.xo) files and then can link said .xo files together to create an executable program; the .xo files contain an RTL representation of the accelerated kernels and the executable program is the design to be programmed onto the AWS F1 FPGA.
When the source code is OpenCL or C/C++ the Vivado High-Level Synthesis (HLS) tool is used under the hood to create the RTL that aims to match the required performance and then an .xo file is created using the Vivado toolchain.
When the source code is RTL, then the Vivado tool creates the .xo file directly without using Vivado HLS.
SDAccel also uses a platform
which contains the AWS F1 Shell and a set of IPs needed for SDAccel to interface with the kernels.
This document further describe the above and links to documentation or concepts discussed in the AWS SDAccel README and/or in the Xilinx SDAccel documentation.
Note: the initial version of this document used the 2017.1 documentation version.
The figure below shows:
- The design flow overview on the left hand side, and uses the
xocc
options names, - The methodology flow on the right hand side.
Figure: SDAccel Design Flow for Amazon F1
As described in the AWS SDAccel README, the SDAccel Development Environment enables the integration of accelerator kernels into a design to be programmed on the AWS F1 FPGA instances. In this section we are detailing the xocc
command line options that are necessary to create the design to be programmed onto the AWS F1 FPGA.
(A)
First and foremost, xocc
need the information about the platform it is targetting. On AWS, you must always select the target hardware using --platform $AWS_PLATFORM
The alternative forms are --platform /PATH/TO/xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0
or --platform xilinx:aws-vu9p-f1:4ddr-xpr-2pr:4.0
if the platform is installed in on-premise.
Furthermore, the --target
option allows software emulation (sw_emu), hardware emulation (hw_emu) or hardware FPGA (hw, default) targets to be created. Further details are provided in Chapter 4 of latest UG1023.
- Software Emulation target: Verifies the functional behavior of the host code and kernel operation via pure software execution
- Hardware Emulation target: Generates custom hardware and confirms kernel(s) performance values via RTL Simulation
- Hardware target: Implements the RTL hardware for the AWS F1 FPGA and allows comfirmation of real-time operation on FPGA
(B)
Without the --compile
option, .xclbin files are generated: they contain the FPGA bitstream and metadata to be programmed by the host code with the OCL API clCreateProgramWithBinary() : this is the default mode. With --compile
option, kernels or accelerator functions will be compiled into .xo files independently, in preparation for linking. The default output filename is a.xclbin or a.xo depending on --compile
option; default name is changed with the
--output
option. 'xo' stands for Xilinx objects files.
(C) The RTL Kernel Wizard may be used to create template RTL files to be used and or modified to help create .xo files. Existing RTL designs need to have specific interfaces into compiled .xo files, specific interfaces include ARM’s AXI standard interface: AXI Master for interfacing global memory into DDR or AXI-Lite interface for control by the host and host program. RTL interface are explained in latest UG1023 in section "Expressing a kernel in RTL"
(D)
If --compile
was used to create .xo files, the --link
flag allows multiple compiled kernel .xo files to be linked into a single xclbin file. At this stage, you may optionally link an RTL kernel (packaged .xo file).
(E) For emulations compilation targets, we need a way of describing the hardware platform; this is achieving by using the emconfigutil utility: it generates a file named emconfig.json which is used by the Xilinx runtime to look-up the platform target during host code execution. This file must be in the same directory as host executable. See Chapter "Running Software and Hardware Emulation in XOCC Flow" in latest UG1023
(F) The GUI automates the flow through the use of target selection. Example, simply select target Hardware Emulation create all performance reports and perform hardware emulation.
(1) Software emulation confirms the functional correctness of the design and algorithm.
- Software emulation has short compile times (does not create FPGA programming files).
- Confirms the functionality is correct
- Software emulation is platform independent.
- Review performance and implementation estimates
- You can review the API calls in profile view or summary.
- Source code may be optimized for FPGA similarly to other source code optimized for CPU/GPU in advance of running on the target hardware.
- Most design refinements and code transformations may be done here.
- Software emulation does not support kernels expressed in RTL.
(2) Iterate around the Software emulation flow to create and verify the correct functionality
- Verify changes like changes in NDrange, work group size or multiple CU: adapt host code
- Adapt code for FPGA, can be, but not limited to caching of data, bursting to utilize optimum memory transfers, larger bus bitwidth for the data transfers, alternative micro-architecture exploration. See details in optimization guide latest UG1207
- Verify "host:kernel co-optimization" above still make sense and was not broken by using incompatible interface or different behavior to the original.
(3) Hardware emulation will produce an RTL hardware model for FPGA which accurately reflects the hardware with size, implementation and performance estimates in terms of latency.
- Medium compile times
- Confirm the hardware performance estimates through profiling
- Review detailed performance traces to determine bottlenecks
- OpenCL attributes & SDAccel optimization directives to improve performance
- This is where users get hardware estimation numbers: review the reports to confirm assumptions from software emulation about data ports bitwidths on the interface, verify optimization attributes are used and behaving as expected, for example review pipelining or add pipeline attributes as needed etc See section kernel synthesis report section in latest UG1023
- Hardware estimation provide profiling report: review to confirm data transfers, runtime estimations, performance of kernel, transfers etc See profiling summary report section in latest UG1023
- Verify waveforms or debug hangs or stalls caused by overly optimized buffers. See application timeline section in latest UG1023
(4) Perform different Hardware emulation runs to optimize the hardware or further explore tradeoffs.
- kernel compilation only to get hardware numbers estimations refinements (compile time, static behavior), reiterate to tune or tailor the hardware to reduce area (size of pipes or FIFO), decrease or increase hardware parallelization etc
- hw_emu runs to get dynamic behavior refinements: change of parallelism, optimization attributes, different work group size and compute units, interaction of kernels and memory etc
- data transfer changes and further adjustment of transfers lengths or ports data widths
(5) Hardware compilation target will create a bitstream to program the hardware target with fully accurate hardware implementation details
- Takes the longest time to compile
- Running on hardware (ie FPGA device and board) is the most accurate for the profiling
- Running on hardware should have the fastest execution time,
- Can highlight issues: meeting timing, kernel clock changes, system failures, routing failures
(6) Iterate hw runs to perform
- Hardware debug/correctness
- Absolute accurate transfers and performance numbers
This section demonstrates how to get started using SDAccel Development Environment using the Onboarding examples. For the purpose of this demonstration, we choose a simple vector addition example from the Onboarding examples.
As part of the script sdk_setup.sh, the Xilinx Onboarding examples are pulled as a submodule to the aws git checkout area in the directory sdk/SDAccel/examples. Note, they can also be pulled from Xilinx Github page or the SDAccel GUI. https://github.com/aws/aws-fpga/tree/master/sdk/SDAccel/examples The examples are setup to run from a makefile flow and no other changes are needed for command line interaction.
Figure: SDAccel onboarding examples submodules link on the AWS EC2 SDAccel examples
The github examples use a common library and those needs to be copied in the local project source folder to make it easier to use. Type the command make local-files
to copy all necessary files in the local directory.
Secondly, when creating the project, the custom hardware platform for AWS needs to be selected via the “add custom platform...” button in hardware platforms wizard page. This is because by default no platforms are provided as part of AWS installation of SDAccel. Select “add custom platform...” and browse to the xpfm located inside the xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0 directory; this should be $SDK_DIR/SDAccel/aws_platform/xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0/
Before using an onboarding example in the GUI, it must be downloaded as an example template. The steps to do this are described below.
- Open the SDx GUI by running the following command in the terminal window:
sdx
- Select a workspace,
- Select SDx Example Store... from the Xilinx menu (please refer to figure below)
Figure: Accessing SDx Example store through GUI
- The above step will open a dialog where you can install the desired examples (please refer figure below).
Figure: Installing Getting Started examples
The directory named "getting_started" contains the onboarding examples for SDAccels. All the examples are categorized in different categories focusing on particular aspects of the coding style, for example, host code, kernel code, data movement, etc. There are different ways to run these examples.
- GUI flow using design files from AWS github
- GUI flow not using design files from AWS github: the SDAccel GUI can download/install examples from Xilinx SDAccel GitHub repository
A project need to be created importing the source files either from outside or from inside of the AWS github sources and using the custom AWS platform DSA. Once the project is created, we can compile and run in Software Emulation, Hardware Emulation, or Hardware mode by selecting the Active build configuration inside the SDx Project Settings as shown below:
Figure: Selection of Software Emulation, Hardware Emulation or Hardware run through GUI
After selecting the Build configuration, we can compile and run by selecting proper icon from the GUI taskbar
Figure: Compilation and execution.
You can navigate through some additional report files through the GUI
Figure: Various report files from Software and Hardware Emulation run
All the target modes generate Profile Summary. Application Timeline view is available in hw_emu and hw targets.
Profile Summary generates profiling data on host execution and gives runtime performance estimates.
Application timeline view collects and displays host and device events on a common timeline to help you understand and visualize the overall health and performance of your systems.
Hardware Emulation mode generates a couple of additional report file as well.
System Performance estimate reports that provides information about Timing, Latency and Area information for each Kernel.
High-level Synthesis report of the Kernel code provides the performance of logic utilization of the custom generated hardware logic from the Kernel.
This section lists issues/perceived issue and their associated investigations or solutions.
What can we investigate when xocc fails with a path not meeting timing - e.g. WARNING: [XOCC 60-732] Link warning: One or more timing paths failed timing targeting <ORIGINAL_FREQ> MHz for <CLOCK_NAME>. The frequency is being automatically changed to <NEW_SCALED_FREQ> MHz to enable proper functionality.
- Generally speaking, lowering the clock will make the design functional in terms of operations (since there will not be timing failures) but the design might not operate at the performance needed due this clock frequency change. We can review what can be done.
- If CLOCK_NAME is
kernel clock 'DATA_CLK'
then this is the clock that drives the kernels. Try reduce kernel clock frequency see --kernel_frequency option to xocc in latest UG1023 - If CLOCK_NAME is
system clock 'clk_main_a0'
then this is the clock clk_main_a0 which drives the AXI interconnect between the AWS Shell and the rest of the platform (SDAccel peripherals and user kernels). Using --kernel_frequency as above does not have any direct effect but might have side effect in changing the topology/placement of the design and improve this issue. - If OCL/C/C++ kernels were also used, investigate VHLS reports / correlate with kernel source code to see if there are functions with large number of statements in basic block, examples: might have unrolled loops with large loop-count, might have a 100++ latency; the VHLS runs and log files are located in the directory named
_xocc*compile*
- Try
xocc -O3
to run bitstream creation process with higher efforts. - Open vivado implementation project
vivado `find -name ipiimpl.xpr`
to analyze the design; needs Vivado knowledge; see UltraFast Design Methodology Guide for the Vivado
- Look into utilization report
- If OCL/C/C++ kernels were also used, look into the source code for excessive unroll happening.
- There are 2 vivado projects:
- first one for the design in the CL, can open from command line with ```vivado `find -name ipiprj.xpr```` to see the connectivity of the created design
- second is the implementation project
vivado `find -name ipiimpl.xpr\`
to analyze the design in the place and routing design phases; needs Vivado knowledge; see UltraFast Design Methodology Guide for the Vivado
- Verify hw_emu works as expected; use less data if needed in hw_emu
- Assert where board run fails and check same conditions for hw_emu
- See "Chapter 8 - Debugging Applications in the SDAccel Environment" in latest UG1023
- double check that the FPGA or platform provided match xilinx:aws-vu9p-f1:4ddr-xpr-2pr:4.0 you using same board and version as DSA set for xclbin?
- Board already in use: run xbsak with query option to check status.
- Is the code linking to *.so libs and are they setup correctly on the compiler command line argument
- Note, there has been issues reported where -ldl or -lxilinxopencl needed to be put as the last argument of the comman line for the compiler; try linking on the command line and see if moving the -l options corrects the issue.
- Is LD_LIBRARY_PATH setup correctly?
- emconfig needed to run once to create description of HW device; the makefile should automate this.
- Is XCL_EMULATION_MODE=true in the env or subshell?
- arrow down failure: what mismatches, only LSB bits different?
- Differences due to floating point?
- Run valgrind on executable to assert no seg faults or out of bounds accesses
- Have a reduced testcase data size if hw_emu takes too long
- Have sdaccel.ini configured with [Emulation] section using launch_waveform=gui or batch to generate waveform for analysis; see "Application Timeline" in latest UG1023
- SDAccel flow does not allow less that 60 MHz kernel clock
- Raw xclbin (.xcp file) from xocc is not usable
- Directly using the .xcp file without conversion to .xclbin file will result in an error - Error: ... invalid binary
- Use the package_dcp.sh file (AWS script) to convert the .xcp file to .xclbin
- Look inside the status file to determine if the bitstream generation is complete
The AWS SDAccel README.
Xilinx web portal for Xilinx SDAccel documentation and for Xilinx SDAccel GitHub repository
Links pointing to latest version of the user guides
- UG1023: SDAccel Environment User Guide
- UG1021: SDAccel Environment Tutorial: Getting Started Guide (including emulation/build/running on H/W flow)
- UG1207: SDAccel Environment Optimization Guide
- UG949: UltraFast Design Methodology Guide for the Vivado Design Suite
Links pointing to 2017.1 version of the user guides