Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed May 25, 2024
1 parent 854237c commit ca7101d
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 24 deletions.
1 change: 1 addition & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ hipother
HIPRTC
hcBLAS
icc
inplace
Interoperation
interoperate
Intrinsics
Expand Down
36 changes: 18 additions & 18 deletions docs/how-to/hip_porting_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ info: hipify ./kmeans_cuda_kernel.cu =====>
* Interesting information in `kmeans_cuda_kernel.cu` :
* How many CUDA calls were converted to HIP (40)
* Breakdown of the CUDA functionality used (dev:0 mem:0 etc). This file uses many CUDA builtins (37) and texture functions (3).
* Breakdown of the CUDA functionality used (`dev:0 mem:0` etc). This file uses many CUDA builtins (37) and texture functions (3).
* Warning for code that looks like CUDA API but was not converted (0 in this file).
* Count Lines-of-Code (LOC) - 185 for this file.
Expand Down Expand Up @@ -97,7 +97,7 @@ Most CUDA libraries have a corresponding ROCm library with similar functionality
| cuFFT | hipFFT | rocFFT | Fast Fourier Transfer Library
| cuSPARSE | hipSPARSE | rocSPARSE | Sparse BLAS + SPMV
| cuSOLVER | hipSOLVER | rocSOLVER | Lapack library
| AmgX | N/A | rocALUTION | Sparse iterative solvers and preconditioners with geometric and algebraic multiGrid
| AmgX | N/A | rocALUTION | Sparse iterative solvers and preconditioners with algebraic multigrid
| Thrust | N/A | rocThrust | C++ parallel algorithms library
| CUB | hipCUB | rocPRIM | Low Level Optimized Parallel Primitives
| cuDNN | N/A | MIOpen | Deep learning Solver Library
Expand Down Expand Up @@ -143,9 +143,9 @@ Compiler directly generates the host code (using the Clang x86 target) and passe
### Identifying Current Compilation Pass: Host or Device
nvcc makes two passes over the code: one for host code and one for device code.
NVCC makes two passes over the code: one for host code and one for device code.
HIP-Clang will have multiple passes over the code: one for the host code, and one for each architecture on the device code.
`__HIP_DEVICE_COMPILE__` is set to a nonzero value when the compiler (HIP-Clang or nvcc) is compiling code for a device inside a `__global__` kernel or for a device function. `__HIP_DEVICE_COMPILE__` can replace `#ifdef` checks on the `__CUDA_ARCH__` define.
`__HIP_DEVICE_COMPILE__` is set to a nonzero value when the compiler (HIP-Clang or NVCC) is compiling code for a device inside a `__global__` kernel or for a device function. `__HIP_DEVICE_COMPILE__` can replace `#ifdef` checks on the `__CUDA_ARCH__` define.
```cpp
// #ifdef __CUDA_ARCH__
Expand All @@ -164,8 +164,8 @@ Unlike `__CUDA_ARCH__`, the `__HIP_DEVICE_COMPILE__` value is 1 or undefined, an
|`__HIP_DEVICE_COMPILE__` | 1 if compiling for device; undefined if compiling for host | 1 if compiling for device; undefined if compiling for host | Undefined
|`__HIPCC__` | Defined | Defined | Undefined
|`__HIP_ARCH_*` | 0 or 1 depending on feature support (see below) | 0 or 1 depending on feature support (see below) | 0
|nvcc-related defines:|
|`__CUDACC__` | Defined if source code is compiled by nvcc; undefined otherwise | Undefined
|NVCC-related defines:|
|`__CUDACC__` | Defined if source code is compiled by NVCC; undefined otherwise | Undefined
|`__NVCC__` Undefined | Defined | Undefined
|`__CUDA_ARCH__` | Undefined | Unsigned representing compute capability (e.g., "130") if in device code; 0 if in host code | Undefined
|hip-clang-related defines:|
Expand Down Expand Up @@ -266,7 +266,7 @@ It can replace <<< >>>, if the user so desires.
## Compiler Options
hipcc is a portable compiler driver that will call nvcc or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler.
hipcc is a portable compiler driver that will call NVCC or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler.
The `hipconfig` script may helpful in identifying the target platform, compiler and runtime. It can also help set options appropriately.
### Compiler options supported on AMD platforms
Expand All @@ -288,18 +288,18 @@ Here are the main compiler options supported on AMD platforms by HIP-Clang.
### Linking With hipcc
hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (nvcc or AMD compiler). We recommend linking with hipcc since it automatically links the binary to the necessary HIP runtime libraries. It also has knowledge on how to link and to manage the GPU objects.
hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (NVCC or AMD compiler). We recommend linking with hipcc since it automatically links the binary to the necessary HIP runtime libraries. It also has knowledge on how to link and to manage the GPU objects.
### `-lm` Option
hipcc adds `-lm` by default to the link command.
## Linking Code With Other Compilers
CUDA code often uses nvcc for accelerator code (defining and launching kernels, typically defined in `.cu` or `.cuh` files).
It also uses a standard compiler (g++) for the rest of the application. nvcc is a preprocessor that employs a standard host compiler (gcc) to generate the host code.
Code compiled using this tool can employ only the intersection of language features supported by both nvcc and the host compiler.
In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported---for example, recent nvcc versions lack Clang host-compiler capability.
CUDA code often uses NVCC for accelerator code (defining and launching kernels, typically defined in `.cu` or `.cuh` files).
It also uses a standard compiler (g++) for the rest of the application. NVCC is a preprocessor that employs a standard host compiler (gcc) to generate the host code.
Code compiled using this tool can employ only the intersection of language features supported by both NVCC and the host compiler.
In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported---for example, recent NVCC versions lack Clang host-compiler capability.
HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using "standard" compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
Expand All @@ -316,7 +316,7 @@ When cross-linking C++ code, any C++ functions that use types from the C++ stand
Applications with these interfaces should use the default libstdc++ linking.
Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to nvcc, may choose to use libc++.
Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to NVCC, may choose to use libc++.
### HIP Headers (hip_runtime.h, hip_runtime_api.h)
Expand All @@ -342,7 +342,7 @@ You can capture the `hipconfig` output and passed it to the standard compiler; b
CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
```
nvcc includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included.
NVCC includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included.
Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers.
If the compilation process reports that it cannot find necessary APIs (for example, `error: identifier hipSetDevice is undefined`),
ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate).
Expand All @@ -354,7 +354,7 @@ The HIP-Clang path provides an empty `cuda.h` file. Some existing CUDA programs
### Choosing HIP File Extensions
Many existing CUDA projects use the `.cu` and `.cuh` file extensions to indicate code that should be run through the nvcc compiler.
Many existing CUDA projects use the `.cu` and `.cuh` file extensions to indicate code that should be run through the NVCC compiler.
For quick HIP ports, leaving these file extensions unchanged is often easier, as it minimizes the work required to change file names in the directory and #include statements in the files.
For new projects or ports which can be re-factored, we recommend the use of the extension `.hip.cpp` for source files, and
Expand All @@ -364,7 +364,7 @@ run hipcc when appropriate.
## Workarounds
### warpSize
### ``warpSize``
Code should not assume a warp size of 32 or 64. See [Warp Cross-Lane Functions](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html#warp-cross-lane-functions) for information on how to write portable wave-aware code.
Expand Down Expand Up @@ -496,7 +496,7 @@ HIP/HIP-Clang does not provide this functionality. As a workaround, users can s
### Textures and Cache Control
Compute programs sometimes use textures either to access dedicated texture caches or to use the texture-sampling hardware for interpolation and clamping. The former approach uses simple point samplers with linear interpolation, essentially only reading a single point. The latter approach uses the sampler hardware to interpolate and combine multiple samples. AMD hardware, as well as recent competing hardware, has a unified texture/L1 cache, so it no longer has a dedicated texture cache. But the nvcc path often caches global loads in the L2 cache, and some programs may benefit from explicit control of the L1 cache contents. We recommend the `__ldg` instruction for this purpose.
Compute programs sometimes use textures either to access dedicated texture caches or to use the texture-sampling hardware for interpolation and clamping. The former approach uses simple point samplers with linear interpolation, essentially only reading a single point. The latter approach uses the sampler hardware to interpolate and combine multiple samples. AMD hardware, as well as recent competing hardware, has a unified texture/L1 cache, so it no longer has a dedicated texture cache. But the NVCC path often caches global loads in the L2 cache, and some programs may benefit from explicit control of the L1 cache contents. We recommend the `__ldg` instruction for this purpose.
AMD compilers currently load all data into both the L1 and L2 caches, so `__ldg` is treated as a no-op.
Expand Down Expand Up @@ -553,7 +553,7 @@ enum LogMask {
### Debugging hipcc
To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to stderr the HIP-clang (or NVCC) commands that hipcc generates.
To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to ``stderr`` the HIP-clang (or NVCC) commands that hipcc generates.
```bash
export HIPCC_VERBOSE=1
Expand Down
12 changes: 6 additions & 6 deletions docs/reference/kernel_language.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1323,10 +1323,10 @@ Following is the list of supported integer intrinsics. Note that intrinsics are
* - | ``unsigned long long int __umul64hi(unsigned long long int x, unsigned long long int y)``
| Returns the most significant 64 bits of the product of the two 64 unsigned bit integers.
The HIP-Clang implementation of ``__ffs()`` and ``__ffsll()`` contains code to add a constant +1 to produce the ffs result format.
The HIP-Clang implementation of ``__ffs()`` and ``__ffsll()`` contains code to add a constant +1 to produce the ``ffs`` result format.
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
HIP-Clang provides ``__lastbit_u32_u32(unsigned int input)`` and ``__lastbit_u32_u64(unsigned long long int input)``.
The index returned by ``__lastbit_`` instructions starts at -1, while for ffs the index starts at 0.
The index returned by ``__lastbit_`` instructions starts at -1, while for ``ffs`` the index starts at 0.

Floating-point Intrinsics
--------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -2121,10 +2121,10 @@ This function produces a similar effect of using ``asm("trap")`` in the CUDA cod
In HIP, the function terminates the entire application, while in CUDA, ``asm("trap")`` only terminates the dispatch and the application continues to run.


Printf
``printf``
============================================================

Printf function is supported in HIP.
``printf`` function is supported in HIP.
The following is a simple example to print information in the kernel.

.. code-block:: cpp
Expand Down Expand Up @@ -2159,7 +2159,7 @@ GPU multiprocessors have a fixed pool of resources (primarily registers and shar
``__launch_bounds__`` supports two parameters:
- MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the ``.maxntid`` PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time.
The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z).
The threads-per-block is the product of (``blockDim.x * blockDim.y * blockDim.z``).
- MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage.

When launch kernel with HIP APIs, for example, ``hipModuleLaunchKernel()``, HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds.
Expand Down Expand Up @@ -2281,7 +2281,7 @@ We insert the GCN isa into the kernel using ``asm()`` Assembler statement.
``v_mac_f32_e32`` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
Index for the respective operand in the ordered fashion is provided by `%` followed by position in the list of operands
`"v"` is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list)
Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).
Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the ``asm`` expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).

## C++ Support
The following C++ features are not supported:
Expand Down

0 comments on commit ca7101d

Please sign in to comment.