-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlearn-cuda-nd.html
254 lines (216 loc) · 10.2 KB
/
learn-cuda-nd.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- HTML Meta Tags -->
<title>Malaysia-AI blog learn CUDA N Dimensional</title>
<meta name="description" content="Malaysia-AI blog learn CUDA N Dimensional">
<!-- Facebook Meta Tags -->
<meta property="og:url" content="https://malaysia-ai.org/learn-cuda-nd">
<meta property="og:type" content="website">
<meta property="og:title" content="Malaysia-AI blog learn CUDA N Dimensional">
<meta property="og:description" content="Malaysia-AI blog learn CUDA N Dimensional">
<meta property="og:image" content="https://malaysia-ai.org/images/learn-cuda-nd.png">
<!-- Twitter Meta Tags -->
<meta name="twitter:card" content="summary_large_image">
<meta property="twitter:domain" content="malaysia-ai.org">
<meta property="twitter:url" content="https://malaysia-ai.org/learn-cuda-nd">
<meta name="twitter:title" content="Malaysia-AI blog learn CUDA N Dimensional">
<meta name="twitter:description" content="Malaysia-AI blog learn CUDA N Dimensional">
<meta name="twitter:image" content="https://malaysia-ai.org/images/learn-cuda-nd.png">
<!-- Meta Tags Generated via https://www.opengraph.xyz -->
<style>
body {
line-height: 1.4;
font-size: 16px;
padding: 0 10px;
margin: 50px auto;
max-width: 1000px;
}
#maincontent {
max-width: 62em;
margin: 15 auto;
}
pre {
margin-top: 0px;
white-space: break-spaces;
}
</style>
</head>
<body>
<div id="maincontent" style="margin-top: 70px">
<h2>learn CUDA N Dimension</h2>
<p>As we know, CUDA threads and blocks support up to three dimensional space only (x, y, z), but a tensor can go up
to N dimensional space! For an example, Multi-head Attention is 4D, (batch size, head dimension, sequence length,
dimension), or batch of videos is 5D (batch size, length video, height, width, depth), So how?</p>
<p>Let us look at 3D with shape(10, 10, 10),</p>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 700 500">
<!-- 3D Grid -->
<g id="grid">
<!-- Back face -->
<polygon points="100,100 400,100 400,400 100,400" fill="#f0f0f0" stroke="#000" stroke-width="2" />
<!-- Right face -->
<polygon points="400,100 475,175 475,475 400,400" fill="#e0e0e0" stroke="#000" stroke-width="2" />
<!-- Bottom face -->
<polygon points="100,400 400,400 475,475 175,475" fill="#d0d0d0" stroke="#000" stroke-width="2" />
</g>
<!-- 3D Blocks -->
<g id="blocks">
<!-- Back layer blocks -->
<rect x="120" y="120" width="80" height="80" fill="#b3d9ff" stroke="#0066cc" stroke-width="2" />
<rect x="220" y="120" width="80" height="80" fill="#b3d9ff" stroke="#0066cc" stroke-width="2" />
<rect x="120" y="220" width="80" height="80" fill="#b3d9ff" stroke="#0066cc" stroke-width="2" />
<rect x="220" y="220" width="80" height="80" fill="#b3d9ff" stroke="#0066cc" stroke-width="2" />
</g>
<!-- Threads (showing only in one block for clarity) -->
<g id="threads">
<circle cx="140" cy="140" r="5" fill="#ff9999" stroke="#cc0000" stroke-width="1" />
<circle cx="160" cy="140" r="5" fill="#ff9999" stroke="#cc0000" stroke-width="1" />
<circle cx="140" cy="160" r="5" fill="#ff9999" stroke="#cc0000" stroke-width="1" />
<circle cx="160" cy="160" r="5" fill="#ff9999" stroke="#cc0000" stroke-width="1" />
</g>
<!-- Labels and Annotations -->
<g id="labels" font-family="Arial" font-size="14">
<text x="400" y="50" text-anchor="middle" font-size="18" font-weight="bold">3D with shape(10, 10, 10)</text>
<!-- Axis labels -->
<text x="200" y="420" text-anchor="middle">blockIdx.x, 10</text>
<text x="80" y="250" text-anchor="middle" transform="rotate(-90 80,250)">blockIdx.y, 10</text>
<text x="560" y="420" text-anchor="middle" transform="rotate(-45 470,570)">blockIdx.z, 10</text>
<!-- Block coordinates -->
<text x="160" y="115" text-anchor="middle">(0,0,0)</text>
<text x="260" y="115" text-anchor="middle">(1,0,0)</text>
<text x="160" y="215" text-anchor="middle">(0,1,0)</text>
</g>
</svg>
<p>Actually you can reshape to become 2D by combining second and third dimensions to become (10, 100), yeah you can
say, of course second dimension is 100 because 10 x 10 is equal to 100, straight forward, but how about
geometrically? The intuitive of reshaping.</p>
<p>- We know that 3D got 1000 elements, 10 x 10 x 10.</p>
<p>- We know that 2D got 1000 elements, 10 x 100.</p>
<p>- To reshape 3D to become 2D, we basically take slices of 2D and put at the right side of 10 x 10, to
become
a rectangle.</p>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 600 150">
<!-- 3D cube -->
<g transform="translate(50,50) scale(0.8)">
<path d="M0,0 L60,0 L60,60 L0,60 Z" fill="none" stroke="black" />
<path d="M0,0 L20,-20 L80,-20 L60,0 Z" fill="none" stroke="black" />
<path d="M60,60 L80,40 L80,-20 L60,0 Z" fill="none" stroke="black" />
<text x="30" y="35" font-size="12" text-anchor="middle">10x10x10</text>
<!-- Correct slicing -->
<path d="M20,-20 L80,-20 L80,40 L20,40 Z" fill="rgba(255,0,0,0.2)" stroke="red" />
<path d="M16,-16 L76,-16 L76,44 L16,44 Z" fill="rgba(0,255,0,0.2)" stroke="green" />
<path d="M12,-12 L72,-12 L72,48 L12,48 Z" fill="rgba(0,0,255,0.2)" stroke="blue" />
<text x="50" y="-25" font-size="10">Back</text>
<text x="-10" y="75" font-size="10">Front</text>
</g>
<!-- Arrow -->
<path d="M130,80 L230,80" stroke="black" stroke-width="2" marker-end="url(#arrowhead)" />
<text x="200" y="70" font-size="12" text-anchor="middle">Reshape</text>
<!-- 2D rectangle -->
<rect x="270" y="65" width="300" height="30" fill="none" stroke="black" />
<text x="420" y="60" font-size="12" text-anchor="middle">10x100 2D Array</text>
<!-- Slices in 2D array -->
<rect x="540" y="65" width="30" height="30" fill="rgba(255,0,0,0.2)" stroke="red" />
<rect x="510" y="65" width="30" height="30" fill="rgba(0,255,0,0.2)" stroke="green" />
<rect x="480" y="65" width="30" height="30" fill="rgba(0,0,255,0.2)" stroke="blue" />
<rect x="270" y="65" width="210" height="30" fill="rgba(200,200,200,0.2)" stroke="gray" />
<!-- Arrowhead definition -->
<defs>
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto">
<polygon points="0 0, 10 3.5, 0 7" />
</marker>
</defs>
</svg>
<p>Now you see the column is wider, become 100.</p>
<p>Because we combined from third dimension into second dimension, the slicing happened on third dimension, because
as we know, 3D is just an array of 2D, if we slice at first dimension, it is an array of height x depth, if we
slice at second dimension, it is an array of width x depth, and we slice at third dimension, it is an array of
height x width.</p>
<p>Same goes if you want to reshape 4D to become 3D, you are slicing cubes at 4th dimension and put at 3rd dimension
to become
longer cuboid.</p>
<p>Same goes if you want to reshape 5D to become 3D, you do slicing inside slicing!</p>
<p>To code 5D vector add operation, is simple, you just do `z = z1 * z2 * zn ...`, which `z1` is the third
dimension, `z2` is fourth dimension, `zn` is 3 + n dimension.</p>
<pre>
```cuda
#include <stdio.h>
#include <cuda_runtime.h>
// Kernel definition
__global__ void VecAdd(int *a, int *b, int *c, int rows, int columns, int z) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
int idx = k * rows * columns + j * columns + i;
c[idx] = a[idx] + b[idx];
}
int main()
{
int rows = 100;
int columns = 100;
int z1 = 2;
int z2 = 4;
int z3 = 5;
int z = z1 * z2 * z3;
size_t size = rows * columns * z * sizeof(int);
int *h_a, *h_b, *h_c;
int *d_a, *d_b, *d_c;
h_a = (int*)malloc(size);
h_b = (int*)malloc(size);
h_c = (int*)malloc(size);
for (int i = 0; i < rows * columns * z; i++) {
h_a[i] = i;
h_b[i] = i * 2;
}
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
// 8 * 8 * 8 = 512
dim3 threadsPerBlock(8, 8, 8);
dim3 numBlocks((columns + threadsPerBlock.x - 1) / threadsPerBlock.x,
(rows + threadsPerBlock.y - 1) / threadsPerBlock.y,
(z + threadsPerBlock.z - 1) / threadsPerBlock.z);
VecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, rows, columns, z);
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
for (int i = 0; i < rows * columns; i++) {
if (h_c[i] != h_a[i] + h_b[i]) {
printf("Error: %d + %d != %d\n", h_a[i], h_b[i], h_c[i]);
break;
}
}
free(h_a); free(h_b); free(h_c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
```</pre>
<p>Save the complete code above above (the long one) as `test.cu`, and then you can continue to compile,</p>
<pre>
```bash
nvcc test.cu -o test
./test
```</pre>
<p>Let us try to use debugger,</p>
<pre>
```bash
nvcc -G -g -o test test.cu
cuda-gdb test
break VecAdd
run
```
```text
[New Thread 0x7ffbe6b9f000 (LWP 1306063)]
[New Thread 0x7ffbe588f000 (LWP 1306064)]
[Detaching after fork from child process 1306065]
[New Thread 0x7ffbe4d73000 (LWP 1306078)]
[New Thread 0x7ffbddfbd000 (LWP 1306079)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, VecAdd<<<(13,13,5),(8,8,8)>>> (a=0x7ffbb5a00000, b=0x7ffbb5c00000, c=0x7ffbb5e00000, rows=100, columns=100, z=40) at test.cu:6
6 int i = blockIdx.x * blockDim.x + threadIdx.x;
```</pre>
<p>No issue.</p>
</p>
</div>
</body>
</html>