-
Notifications
You must be signed in to change notification settings - Fork 1
The Unified Accumulator Architecture: A Configurable, Portable, and Extensible Floating-Point Accumulator
License
ARC-Lab-UF/UAA
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
For: The Unified Accumulator Architecture: A Configurable, Portable, and Extensible Floating-Point Accumulator By: David Wilson, Greg Stitt, University of Florida License Statement: GPL Version 3 --------------------------------- Copyright (c) 2015 University of Florida This file is part of uaa. uaa is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. uaa is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. For the details of the GNU General Public License, see the included gpl.txt file, or go to http://www.gnu.org/licenses/. Required Tools and Version Numbers: ----------------------------------- ISE Design Suite (Tested in 13.0+) Quartus II (Tested in 12.0+) ieee_proposed library for simulation Overview of Files: ------------------ build.tcl - ise script that implements design isim.tcl - isim script that runs the testbench simulation sim.prj - ise project file used for simulation purposes makefile - make files with the build and verify targets gpl.txt - contains the gpl 3.0 license src/ - folder containing all of the source files build/ - folder created by makefile to hold compilation files sim/ - folder created by makefile to hold simulation files Overview of VHDL Source Files: ----------------------------- Top Level Entity uaa.vhd - Top level entity for the UAA uaa_pkg.vhd - Package file that defines all accumulator architecture options Common Entities add_flt.vhd - Wrapper file that provides common interface for different adder cores flt_pkg.vhd - Package file containing characteristics about different adder cores add_tree_flt.vhd - Source file for adder tree add_tree_flt_pkg.vhd - Package file for adder tree add_wrapper.vhd - Wrapper file for add_flt providing additional pipeline-related entities delay.vhd - Source file for shift register entity fifo.vhd - Source file for fifo entity fifo_core.vhd - Source file for additional fifo logic ram.vhd - Source file for ram entity math_custom.vhd - Package file containing commonly used math functions DSA Specific dsa.vhd - Main file for DSA architecture dsa_ctrl.vhd - Source file for DSA controllers dsa_pkg.vhd - Package file for DSA FCBT Specific fcbt.vhd - Main file for FCBT architecture fifo2.vhd - Source file for FCBT buffers SGA Specific sga.vhd - Main file for SGA architecture Testbench Specifc uaa_tb.vhd - Testbench uaa_tb_pkg.vhd - Testbench package file uaa_tb_all.vhd - Testbench testing all three architectures using uaa_tb Build Instructions: ------------------- Setup the ISE environment and run the command "make build" in the project folder. This should invoke build.tcl from Xilinx's xtclsh, which should compile the core using XST, with the build outputs in the "build" folder. If you run into problems with this command, create a new project and add the source files. Set the uaa entity as the top-level and run synthesis. Verification Instructions: -------------------------- Setup the ISE environment and run the command "make verify" in the project folder. This should invoke Xilinx's fuse using source VHDL indicated in the provided sim.prj to produce a simulation executable x.exe. The executable is then launched and run through commands provided in the isim.tcl. Wait until the output indicates that "TEST DSA", "TEST FCBT", and "TEST SGA" have completed. Associated simulation files should be generated in the "sim" folder. If you run into problems with this command, open any simulator, add the files in the src/ folder to the project, and select uaa_tb_all.vhd as the testbench. For simulators other than ISIM, be sure to add the simulation libraries for Xilinx's floating-point core or change add_flt.vhd to include an adder core you can simulate. Engineering specification with functionality and all interfaces documented (intended for a technical user): ----------------------------------------------------------------------------------------------------------- The top level file uaa has a common port used to interface with all of the accumulation architectures. See Section 3.2 of the accompanying paper for more details. Port Description: clk : clock input rst : reset input (asynchronous, active high) hold_output : The user can assert this signal to prevent the entity from continuing past a valid output. When hold_output is not asserted, the output is valid for only a single cycle. This signal makes it possible to stall the entity if downstream components can't receive data (active high) ready : The entity asserts this signal when it is ready to accept new inputs. If not asserted, external components must hold the current input or it will be lost (active high) end_of_group : User should be assert on the same cycle as the last input in a group (active high) input : Provides parallel_inputs number of width-bit inputs output : Accumulated output valid_in : User should assert when input is valid and ready is asserted (active high) valid_out : Entity asserts when the output from the dsa is valid. Unless the user asserts hold_output, the entity only asserts valid_out is only asserted for one cycle. (active high) Customization Options and Instructions: --------------------------------------- The top level file uaa has a variety of generics that can be changed to meet one's application. See Section 3.1 of the accompanying paper for more details. Generics Description: arch : Specifies the type of architecture to use. All possible options are included in uaa_pkg.vhd width : The width of the input and output in bits (e.g., 32 bits for single precision, 64 for double) parallel_inputs : The number of inputs provided in parallel as input. For example, if a design can provide 4 inputs every cycle, this should be 4 to maximize throughput. add_core_name : A string representing different optimizations for the actual adder core. See add_flt.vhd and flt_pkg.vhd for more information. use_bram : attempts to use bram when true, uses LUTs or FFs when false FCBT_max_inputs : Specifies the maximum number of inputs for the FCBT architecture. This generic is ignored for other architectures, which support any number of inputs. FCBT_obuf_size : Specifies the size of the output buffer for the FCBT architecture. This generic is ignored for other architectures, which support any number of inputs. Customization Instructions: -------------------------- To make changes to these generics, either manually set the generic values in the ise project or change the constants used in build.tcl, which is used in the make build flow. Adder Core Addition Instructions: -------------------------------- 1) Select a floating-point adder core for the targeted FPGA. This project uses several Xilinx and Altera cores as examples. You may want to create multiple versions of the core with different optimization goals (e.g., speed, area). Make note of the pipeline latencies of each core. Note that any core can work as long as it has a pipeline latency greater than 1 cycle and can be stalled. 2) Open src/flt_pkg.vhd. Modify the add_flt_latency function to define the pipeline latency of each optimization option your are using. e.g. An area-optimized core might have a latency of 7 cycles, whereas a speed-optimized core might have a latency of 14 cycles. The "core_name" string represents a user-defined name for the optimization, which can be used as an input parameter to the top-level entity uaa. If you only have one adder core, simply return the latency of that core while ignoring the core_name string. 3) Open src/add_flt.vhd. Modify the architecture to instantiate a corresponding adder core for each relevant value of the "core_name" string. If you only have one adder core, you can ignore the core_name string. 4) You can now instantiate the floating-point accumulator in uaa.vhd. You can select between different adder core options by specifying the names you selected in the "core_name" generic. If you did not use the "core_name" string in add_flt_pkg or add_flt, the "core_name" generic is ignored. COMMON PROBLEMS: The add_tree_flt entity use a recursive structural architecture, which is not supported by all simulators. We have tested the code with Modelsim and Xilinx's ISIM, which comes with ISE. Simulations with the provided testbench require installation of fphdl from the ieee_proposed library: http://www.eda-stds.org/fphdl/ Most new tools already include this library, but if not, we have included the library in the ieee_proposed folder. When targeting older Xilinx FPGAs, you may need to use a flag to force ISE to use the new parser. To do this, Put "-use_new_parser yes" in the synthesis properties. ISE also sometimes has problems setting enum generics, so creating a top_level file that uses the uaa entity may be an alternative if this is a problem. If you run into problems with "make build", create a new project and add the source files. Set the uaa entity as the top-level and run synthesis. It has been reported some users run into segmentation faults when using "make verify" which appears to be an issue for some versions of ISE on certain systems. If this does occur, try running ISIM in GUI mode. If you run into other problems with "make verify", open any simulator, add the files in the src/ folder to the project, and select uaa_tb_all.vhd as the testbench. For simulators other than ISIM, be sure to add the simulation libraries for Xilinx's floating-point core or change add_flt.vhd to include an adder core you can simulate. As mentioned before, you may need to setup the ieee_proposed library. If you are having trouble running make sure you run the following steps: 1. Clone the GitHub repository git clone https://github.com/ARC-Lab-UF/UAA 2. Setup the Xilinx environment. The makefile uses scripts that run different Xilinx executables from the terminal which requires the Xilinx environment. export LM_LICENSE_FILE=<floating license location> -- if needed and not already done source <ISE_PATH>/settings64.sh -- setup env if not already on path 3. Navigate to main directory cd <PROJECT_PATH>/uaa 3. Make with build or verify option make build make verify Description of Verification Suite: ---------------------------------- The verification suite consists of top level testbench (uaa_tb_all.vhd) that instantiates a highly customizable testbench (uaa_tb.vhd) for each of the architectures and verifies that each architectures produces a correct output. The uaa_tb.vhd testbench provides numerous generics for controlling the behavior of the simulation. To test multiple architectures, or different configurations of the same architecture, instantiate this testbench multiple times within another testbench. For output, the testbench will report the number of groups that had outputs that exceeded both the acceptable percent difference of an estimated accurate result calculated through double precision and the sequentially accumulated result. Each individual input group will report these differences in a double precision hex string. A user can also test for architectural correctness by asserting test_for_arch, which feeds the architectures '1's for inputs, and checks that the output has zero groups with output differences from the estimated accurate result calculated with double precision floating point values and from the sequential accumulated result. This test works because using '1's avoids floating point rounding issues with vastly different exponents and because the same input values negate differences between the sequential order of additions and the architecture's order of additions. For more information, about each parameter, see the descriptions below: Port Descriptions: test_name : Specifies the name of the current test, which the testbench prints during the simulation. test_for_arch : Specifies whether the testbench is testing for architectural correctness by feeding the architectures '1's for inputs. Using '1's ensures the output is not affected by rounding and keeps the order of additions equivalent to the sequential order. This test should not be asserted with test_for_error. test_for_error : Specifies whether the testbench is testing for error in the final result by feeding the architectures input values with randomized exponent values. Use min_input_exp and max_input_exp to limit the range of the input's exponents. This test should not be asserted with test_for_arch. arch : Specifies the type of architecture to use. All possible options are included in uaa_pkg.vhd parallel_inputs : The number of inputs provided in parallel as input. For example, if a design can provide 4 inputs every cycle, this should be 4 to maximize throughput. add_core_name : A string representing different optimizations for the actual adder core. See add_flt.vhd and flt_pkg.vhd for more information. use_bram : Attempts to use bram when true, uses LUTs when false FCBT_max_inputs : Specifies the maximum number of inputs for the FCBT architecture. This generic is ignored for other architectures, which support any number of inputs. total_groups : Specifies the total number of groups to test. min_group_size : Specifies the minimum size of any group. If this value is not a multiple of parallel_inputs, the testbench will increase it to the nearest multiple. max_group_size : Specifies the maximum size of any group. If this value is not a multiple of parallel_inputs, the testbench will increase it to the nearest multiple. min_input_exp : The minimum random value used for an input's exponent. This parameter is only relevant when test_for_error is asserted max_input_exp : The maximum random values used for an input's exponent. This parameter is only relevant when test_for_error is asserted input_delay_prob : The probability that and input will be delayed (i.e. will not appear one cycle after the previous input. min_input_delay : If an input is delayed, this specifies the minimum random cycle delay. max_input_delay : If an input is delayed, this specifies the maximum random cycle delay. group_delay_prob : The probability that there will be a delay in between consecutive input groups. min_group_delay : If a group is delayed, this specifies the minimum random cycle delay. max_group_delay : If a group is delayed, this specifies the maximum random cycle delay. hold_output_prob : The probability during any cycle that hold_ouput will be asserted. min_hold_output_delay : If hold_output is asserted, this specifies the minimum random cycle delay before deassertion. max_hold_output_delay : If hold_output is asserted, this specifies the maximum random cycle delay before deassertion. junk_input_prob : The probability that a junk value will be used as input while accumulator is not ready (i.e. ready = '0'). acceptable_error_percent : Specifies acceptable percentage difference between the actual output and an estimated correct output due to floating rounding. Also, specifies acceptable percentage difference between the actual output and a sequentially accumulated output due to assumption of associativity. print_status : When asserted, simulation prints information that specifies the current status. Useful for long simulations. start_group : Use its falling_edge to indicate the beginning of a new group, and apply changes to parameter port signals Known Limitations: ------------------ - There are a variety of changes that could be applied to each architecture implementation that could better optimize the architecture toward a specific goal, but have not been implemented. For example, the FCBT has a lower clock rate that could mitigated through further pipelining of the control and the datapath. Similarly, the FCBT uses large multiplixers that greatly reduce the clock speed as the number of inputs increases, which could be replaced by an alternative implementation to make the design more scalable. - The testbench's method of verification relies on estimating the accurate result with double precision values. This method is therefore incapable of accurately determining error when tested against double precision adder cores.
About
The Unified Accumulator Architecture: A Configurable, Portable, and Extensible Floating-Point Accumulator
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published