From 4b94e62fe5523a296239b2762904284208708199 Mon Sep 17 00:00:00 2001
From: Istvan Kiss <neon60@gmail.com>
Date: Tue, 28 May 2024 12:02:12 +0200
Subject: [PATCH] WIP

---
 .wordlist.txt                | 9 +++++++++
 docs/tutorials/reduction.rst | 2 +-
 docs/tutorials/saxpy.rst     | 8 ++++----
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/.wordlist.txt b/.wordlist.txt
index 45af247c0d..c0fee3d594 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -3,6 +3,7 @@ ALUs
 AmgX
 APU
 AQL
+AXPY
 Asynchrony
 backtrace
 Bitcode
@@ -23,6 +24,8 @@ EIGEN
 EIGEN's
 enqueue
 enqueues
+entrypoint
+entrypoints
 enum
 embeded
 extern
@@ -40,6 +43,7 @@ hipother
 HIPRTC
 hcBLAS
 icc
+IILE
 inplace
 Interoperation
 interoperate
@@ -67,6 +71,8 @@ NDRange
 nonnegative
 Numa
 Nsight
+overindex
+overindexing
 oversubscription
 preconditioners
 prefetched
@@ -80,13 +86,16 @@ ROCm's
 rocTX
 RTC
 RTTI
+SAXPY
 scalarizing
 sceneries
+shaders
 SIMT
 SPMV
 structs
 SYCL
 syntaxes
+tradeoffs
 typedefs
 WinGDB
 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
\ No newline at end of file
diff --git a/docs/tutorials/reduction.rst b/docs/tutorials/reduction.rst
index e8521e2a1f..8dd155237c 100644
--- a/docs/tutorials/reduction.rst
+++ b/docs/tutorials/reduction.rst
@@ -948,7 +948,7 @@ which all Multi Processors can access and is also on-chip memory.
     Processor for longer than necessary.
 
 Without launching a second kernel, have the last block collect the results of
-all other blocks from GDS (either implicitly exploiting the sceduling behavior
+all other blocks from GDS (either implicitly exploiting the scheduling behavior
 or relying on Global Wave Sync, yet another AMD-specific feature) to merge them
 for a final tree-like reduction.
 
diff --git a/docs/tutorials/saxpy.rst b/docs/tutorials/saxpy.rst
index 4ecefffd2d..8da7e95a50 100644
--- a/docs/tutorials/saxpy.rst
+++ b/docs/tutorials/saxpy.rst
@@ -26,7 +26,7 @@ Heterogenous Programming
 
 Heterogenous programming and offloading APIs are often mentioned together.
 Heterogenous programming deals with devices of varying capabilities at once
-while the term offloading focuses on the "remote" and asnychronous aspect of
+while the term offloading focuses on the "remote" and asynchronous aspect of
 the computation. HIP encompasses both: it exposes GPGPU (General Purpose GPU)
 programming much like ordinary host-side CPU programming and let's us move data
 to and from device as need be.
@@ -71,7 +71,7 @@ work, then issue:
 
   git clone https://github.com/amd/rocm-examples.git
 
-Inside the repo, you should find ``HIP-Basic\saxpy\main.hip`` which is a
+Inside the repository, you should find ``HIP-Basic\saxpy\main.hip``, which is a
 sufficiently simple implementation of SAXPY. It was already mentioned
 that HIP code will mostly deal with where and when data has to be and
 how devices will transform it. The very first HIP calls deal with
@@ -120,8 +120,8 @@ First let's discuss the signature of the offloaded function:
   entrypoint to a device program, such that it can be launched from the host.
 - The function does not return anything, because there is no trivial way to
   construct a return channel of a parallel invocation. Device-side entrypoints
-  may not return a value, their results should be communicated using out
-  params.
+  may not return a value, their results should be communicated using output
+  parameters.
 - Device-side functions are typically called compute kernels, or just kernels
   for short. This is to distinguish them from non-graphics-related graphics
   shaders, or just shaders for short.