From 4b94e62fe5523a296239b2762904284208708199 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Tue, 28 May 2024 12:02:12 +0200 Subject: [PATCH] WIP --- .wordlist.txt | 9 +++++++++ docs/tutorials/reduction.rst | 2 +- docs/tutorials/saxpy.rst | 8 ++++---- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index 45af247c0d..c0fee3d594 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -3,6 +3,7 @@ ALUs AmgX APU AQL +AXPY Asynchrony backtrace Bitcode @@ -23,6 +24,8 @@ EIGEN EIGEN's enqueue enqueues +entrypoint +entrypoints enum embeded extern @@ -40,6 +43,7 @@ hipother HIPRTC hcBLAS icc +IILE inplace Interoperation interoperate @@ -67,6 +71,8 @@ NDRange nonnegative Numa Nsight +overindex +overindexing oversubscription preconditioners prefetched @@ -80,13 +86,16 @@ ROCm's rocTX RTC RTTI +SAXPY scalarizing sceneries +shaders SIMT SPMV structs SYCL syntaxes +tradeoffs typedefs WinGDB zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz \ No newline at end of file diff --git a/docs/tutorials/reduction.rst b/docs/tutorials/reduction.rst index e8521e2a1f..8dd155237c 100644 --- a/docs/tutorials/reduction.rst +++ b/docs/tutorials/reduction.rst @@ -948,7 +948,7 @@ which all Multi Processors can access and is also on-chip memory. Processor for longer than necessary. Without launching a second kernel, have the last block collect the results of -all other blocks from GDS (either implicitly exploiting the sceduling behavior +all other blocks from GDS (either implicitly exploiting the scheduling behavior or relying on Global Wave Sync, yet another AMD-specific feature) to merge them for a final tree-like reduction. diff --git a/docs/tutorials/saxpy.rst b/docs/tutorials/saxpy.rst index 4ecefffd2d..8da7e95a50 100644 --- a/docs/tutorials/saxpy.rst +++ b/docs/tutorials/saxpy.rst @@ -26,7 +26,7 @@ Heterogenous Programming Heterogenous programming and offloading APIs are often mentioned together. Heterogenous programming deals with devices of varying capabilities at once -while the term offloading focuses on the "remote" and asnychronous aspect of +while the term offloading focuses on the "remote" and asynchronous aspect of the computation. HIP encompasses both: it exposes GPGPU (General Purpose GPU) programming much like ordinary host-side CPU programming and let's us move data to and from device as need be. @@ -71,7 +71,7 @@ work, then issue: git clone https://github.com/amd/rocm-examples.git -Inside the repo, you should find ``HIP-Basic\saxpy\main.hip`` which is a +Inside the repository, you should find ``HIP-Basic\saxpy\main.hip``, which is a sufficiently simple implementation of SAXPY. It was already mentioned that HIP code will mostly deal with where and when data has to be and how devices will transform it. The very first HIP calls deal with @@ -120,8 +120,8 @@ First let's discuss the signature of the offloaded function: entrypoint to a device program, such that it can be launched from the host. - The function does not return anything, because there is no trivial way to construct a return channel of a parallel invocation. Device-side entrypoints - may not return a value, their results should be communicated using out - params. + may not return a value, their results should be communicated using output + parameters. - Device-side functions are typically called compute kernels, or just kernels for short. This is to distinguish them from non-graphics-related graphics shaders, or just shaders for short.