Full documentation for rocFFT is available at rocm.docs.amd.com.
- Multi-device FFTs now allow batch greater than 1.
- Multi-device real-complex FFTs (real-complex) are now supported.
-
Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices in a single process.
rocfft_field
is a new type that can be added to a plan description, to describe layout of FFT input or output.rocfft_field_add_brick
can be called one or more times to describe a brick decomposition of an FFT field, where each brick can be assigned a different device.These interfaces are still experimental and subject to change. We are interested to hear feedback on them. Questions and concerns may be raised by opening issues on the rocFFT issue tracker.
Note that at this time, multi-device FFTs have several limitations:
- Real-complex (forward or inverse) FFTs are not currently supported.
- Planar format fields are not currently supported.
- Batch (i.e.
number_of_transforms
provided torocfft_plan_create
) must be 1. - The FFT input is gathered to the current device at execute time, so all of the FFT data must fit on that device.
We expect these limitations to be removed in future releases.
- Improved performance of some small 2D/3D real FFTs supported by 2D_SINGLE kernel. gfx90a gets more optimization by offline tuning.
- Removed an extra kernel launch from even-length real-complex FFTs that use callbacks.
-
Built kernels in solution-map to library kernel cache.
-
Real forward transforms (real-to-complex) no longer overwrite input. rocFFT still may overwrite real inverse (complex-to-real) input, as this allows for faster performance.
-
rocfft-rider and dyna-rocfft-rider have been renamed to rocfft-bench and dyna-rocfft-bench, controlled by the BUILD_CLIENTS_BENCH CMake option. Links for the old file names are installed, and the old BUILD_CLIENTS_RIDER CMake option is accepted for compatibility but both will be removed in a future release.
-
Binaries in debug builds no longer have a "-d" suffix.
- rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g. 16-bit integers -> 32-bit float).
- Improved performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use Bluestein's algorithm.
- Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
- Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
- Moved library kernel cache to lib/rocFFT directory.
- Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create.
- Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used.
- Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems.
- Replaced std::complex with hipComplex data types for data generator.
- FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example).
- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
- Improved performance test suite statistical framework.
- Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure.
- Improved performance of 1D lengths < 2048 that use Bluestein's algorithm.
- Reduced time for generating code during plan creation.
- Optimized 3D R2C/C2R lengths 32, 84, 128.
- Optimized batched small 1D R2C/C2R cases.
- Added gfx1101 to default AMDGPU_TARGETS.
- Moved client programs to C++17.
- Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled.
- Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache.
- Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged.
- Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present.
- Removed source directory from rocm_install_targets call to prevent installation of rocfft.h in an unintended location.
- Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride.
- Optimized some strided large 1D plans.
- Added rocfft_plan_description_set_scale_factor API to efficiently multiply each output element of a FFT by a given scaling factor.
- Created a rocfft_kernel_cache.db file next to the installed library. SBCC/CR/RC kernels are moved to this file when built with the library, and are runtime-compiled for new GPU architectures.
- Added gfx1100 and gfx1102 to default AMDGPU_TARGETS.
- Moved runtime compilation cache to in-memory by default. A default on-disk cache can encounter contention problems on multi-node clusters with a shared filesystem. rocFFT can still be told to use an on-disk cache by setting the ROCFFT_RTC_CACHE_PATH environment variable.
- Runtime compilation cache now looks for environment variables XDG_CACHE_HOME (on Linux) and LOCALAPPDATA (on Windows) before falling back to HOME.
- Moved computation of the twiddle table from host to the device.
- Optimized 2D R2C/C2R to use 2-kernel plans where possible.
- Improved performance of the Bluestein algorithm.
- Optimized sbcc-168 and 100 by using half-lds.
- Optimized length-280 2D/3D transforms.
- Added kernels for factorizable 1D lengths < 128
- Fixed occasional failures to parallelize runtime compilation of kernels. Failures would be retried serially and ultimately succeed, but this would take extra time.
- Fixed failures of some R2C 3D transforms that use the unsupported TILE_UNALGNED SBRC kernels. An example is 98^3 R2C out-of-place.
- Fixed bugs in SBRC_ERC type.
- Packages for test and benchmark executables on all supported OSes using CPack.
- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
- Improved reuse of twiddle memory between plans.
- Set a default load/store callback when only one callback type is set via the API for improved performance.
- Updated googletest dependency to version 1.11.
- Introduced a new access pattern of lds (non-linear) and applied it on sbcc kernels len 64 and 81 to get performance improvement.
- Applied lds-non-linear and direct-load-to-register on sbcr kernels to get performance improvement.
- Applied lds-non-linear and direct-store-from-register on sbrc kernels to get performance improvement.
- Fixed correctness of certain transforms with unusual strides.
- Fixed incorrect handling of user-specified stream for runtime-compiled kernels.
- Fixed incorrect buffer allocation in rocfft-test on in-place transforms with different input and output sizes.
- Supported unaligned tile dimension for SBRC_2D kernels.
- Improved (more RAII) test and benchmark infrastructure.
- Enabled runtime compilation of length-2304 FFT kernel during plan creation.
- Added tokenizer for test suite.
- Reduce twiddle memory requirements for even-length real-complex transforms.
- Clients can now be built separately from the main library.
- Optimized more large 1D cases by using L1D_CC plan.
- Optimized 3D 200^3 C2R case.
- Optimized 1D 2^30 double precision on MI200.
- Added padding to work buffer sizes to improve performance in many cases.
- Fixed correctness of some R2C transforms with unusual strides.
- The hipFFT API (header) has been removed from after a long deprecation period. Please use the hipFFT package/repository to obtain the hipFFT API.
- Enabled runtime compilation of single FFT kernels > length 1024.
- Re-aligned split device library into 4 roughly equal libraries.
- Implemented the FuseShim framework to replace the original OptimizePlan
- Implemented the generic buffer-assignment framework. The buffer assignment is no longer performed by each node. We designed a generic algorithm to test and pick the best assignment path. With the help of FuseShim, we can achieve more kernel-fusions as possible.
- Do not read the imaginary part of the DC and Nyquist modes for even-length complex-to-real transforms.
- Optimized twiddle-conjugation; complex-to-complex inverse transforms should have similar performance to foward transforms now.
- Improved performance of single-kernel small 2D transforms.
- Optimized SBCC kernels of length 52, 60, 72, 80, 84, 96, 104, 108, 112, 160, 168, 208, 216, 224, 240 with new kernel generator.
- Added support for Windows 10 as a build target.
- Packaging split into a runtime package called rocfft and a development package called rocfft-devel. The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
- Fixed a few validation failures of even-length R2C inplace. 2D, 3D cubics sizes such as 100^2 (or ^3), 200^2 (or ^3), 256^2 (or ^3)...etc. We don't combine the three kernels (stockham-r2c-transpose). We only combine two kernels (r2c-transpose) instead.
- Split 2D device code into separate libraries.
- Improved many plans by removing unnecessary transpose steps.
- Optimized scheme selection for 3D problems.
- Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and have some performance gain.
- Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit.
- Force --length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative threshold test.
- Optimized some even-length R2C/C2R cases by doing more operations in-place and combining pre/post processing into Stockham kernels.
- Added radix-17.
- Added new kernel generator for select fused-2D transforms.
- Improved large 1D transform decompositions.
Re-split device code into single-precision, double-precision, and miscellaneous kernels.
- Fixed potential crashes in double-precision planar->planar transpose.
- Fixed potential crashes in 3D transforms with unusual strides, for SBCC-optimized sizes.
- Improved buffer placement logic.
- Added new kernel generator for select lengths. New kernels have improved performance.
- Added public
rocfft_execution_info_set_load_callback
androcfft_execution_info_set_store_callback
API functions to allow executing extra logic when loading/storing data from/to global memory during a transform.
- Removed R2C pair schemes and kernels.
- Optimized 2D/3D R2C 100 and 1D Z2Z 2500.
- Reduced number of kernels for 2D/3D sizes where higher dimension is 64, 128, 256.
- Fixed potential crashes in 3D transforms with unusual strides, for SBCC-optimized sizes.
Move device code into main library.
- Improved performance for single precision kernels exercising all except radix-2/7 butterfly ops.
- Minor optimization for C2R 3D 100, 200 cube sizes.
- Optimized some C2C/R2C 3D 64, 81, 100, 128, 200, 256 rectangular sizes.
- When factoring, test to see if remaining length is explicitly supported.
- Explicitly add radix-7 lengths 14, 21, and 224 to list of supported lengths.
- Optimized R2C 2D/3D 128, 200, 256 cube sizes.
- Fixed potential crashes in small 3D transforms with unusual strides. (ROCm#311)
- Fixed potential crashes when executing transforms on multiple devices. (ROCm#310)
- Explicitly specify MAX_THREADS_PER_BLOCK through __launch_bounds_ for all kernels.
- Switch to new syntax for specifying AMD GPU architecture names and features.
- Optimized C2C/R2C 3D 64, 81, 100, 128, 200, 256 cube sizes.
- Improved performance of the standalone out-of-place transpose kernel.
- Optimized 1D length 40000 C2C case.
- Enabled radix-7 for size 336.
- New radix-11 and radix-13 kernels; used in length 11 and 13 (and some of their multiples) transforms.
- rocFFT now automatically allocates a work buffer if the plan requires one but none is provided.
- An explicit
rocfft_status_invalid_work_buffer
error is now returned when a work buffer of insufficient size is provided. - Updated online documentation.
- Updated debian package name version with separated '_'.
- Adjusted accuracy test tolerances and how they are compared.
- Fixed 4x4x8192 accuracy failure.
- Optimized 1D length 10000 C2C case.
- Added BUILD_CLIENTS_ALL CMake option.
- Fixed correctness of SBCC/SBRC kernels with non-unit strides.
- Fixed fused C2R kernel when a Bluestein transform follows it.
- New R2C and C2R fused kernels to combine pre/post processing steps with transpose.
- Enabled diagonal transpose for 1D and 2D power-of-2 cases.
- New single kernels for small power-of-2, 3, 5 sizes.
- Added more radix-7 kernels.
- Explicitly disable XNACK and SRAM-ECC features on AMDGPU hardware.
- Fixed 2D C2R transform with length 1 on one dimension.
- Fixed potential thread unsafety in logging.
- Improved performance of 1D batch-paired R2C transforms of odd length.
- Added some radix-7 kernels.
- Improved performance for 1D length 6561, 10000.
- Improved performance for certain 2D transform sizes.
- Allow static library build with BUILD_SHARED_LIBS=OFF CMake option.
- Updated googletest dependency to version 1.10.
- Fixed correctness of certain large 2D sizes.
- Optimized C2C power-of-2 middle sizes.
- Parallelized work in unit tests and eliminate duplicate cases.
- Fixed correctness of certain large 1D, and 2D power-of-3, 5 sizes.
- Fixed incorrect buffer assignment for some even-length R2C transforms.
- Fixed
<cstddef>
inclusion on C compilers. - Fixed incorrect results on non-unit strides with SBCC/SBRC kernels.