[alpaka] add element_stride class and test #190

antoniopetre · 2021-06-14T11:58:09Z

The new classes implement range-based for loop for elements indices. In addition, I added a test for the new classes.

src/alpaka/test/alpaka/elements_stride.cc

makortel

Some notes from first read(s).

The code should also be formatted with clang-format (in principle even I can do it just before merge).

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

makortel · 2021-06-14T19:24:06Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

+     * Class which simplifies "for" loops over elements index 
+     */
+    template <typename T, typename T_Acc>
+    class elements_with_stride {


The relationship between elements_with_stride and elements_with_stride_<N>d is not clear to me.

I missed the addition of dimIndex argument, nevermind.

The idea is that elements_with_stride should loop over a single index using a scalar variable; usually this is the 0th index (assuming a one-dimensional kernel) but it can be chosen by dimIndex.
While elements_with_stride_<N>d should loop over an N-dimensional space using a Vec3D variable.

In fact, after having clarified that the platform and device are independent from the dimensionality, it makes sense to change elements_with_stride_<N>d to use a Vec<N>D instead of always a Vec3D.

src/alpaka/test/alpaka/elements_stride.cc

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

antoniopetre · 2021-07-13T09:25:30Z

I will show the comparison between Alpaka-CUDA and Native CUDA for atomics and barriers. I used 10 running times to get the average and the standard deviation. The tests used were added in this PR as well.

For atomics, I used 256 threads/block :

NVidia V100:

Alpaka Atomic ~32k elems	Global Block	Shared Block	Global Grid	Shared Grid
Native CUDA SP (s)	7.3425 +- 0.0859	3.2331 +- 0.0079	29.9642 +- 0.0078	3.2767 +- 0.018
Native CUDA DP (s)	6.1387 +- 0.0447	3.9586 +- 0.0112	29.9637 +- 0.0047	3.9407 +- 0.0038
Alpaka-GPU SP (s)	7.3074 +- 0.057	3.332 +- 0.0104	29.9591 +- 0.0006	3.3435 +- 0.0168
Alpaka-GPU DP (s)	6.1735 +- 0.0266	3.9393 +- 0.0031	29.9601 +- 0.0042	3.9381 +- 0.0001
Alpaka-CPU Serial ~1k elements (s)	7.6338 +- 0.1422	0.5545 +- 0.0048	7.5430 +- 0.0548	0.5547 +- 0.0087

NVidia T4:

Alpaka Atomic ~32k elems	Global Block	Shared Block	Global Grid	Shared Grid
Native CUDA SP (s)	7.2224 +- 0.0329	6.1258 +- 0.0261	25.9548 +- 1.7E-5	5.8128 +- 0.0151
Native CUDA DP (s)	5.776 +- 0.01384	32.522 +- 0.1215	25.9548 +- 1.6E-5	32.3397 +- 0.1153
Alpaka-GPU SP (s)	7.2365 +- 0.0278	5.983 +- 0.0436	25.9548 +- 4E-5	5.8331 +- 0.0443
Alpaka-GPU DP (s)	5.7868 +- 0.0120	32.4997 +- 0.1435	25.9548 +- 4.6E-5	32.4087 +- 0.1177
Alpaka-CPU Serial ~32k elements (s) (first results)	254.906	18.2051	259.499	18.8554

For the syncThreads test, I used 1024 threads/block. For the threadfence, I used 256 threads/block:

NVidia V100:

	syncThreads	w\o syncThreads	Global threadFence	w\o Global threadFence	Shared threadFence	w\o Shared threadFence
Alpaka-CUDA (s)	0.072 +- 1.37E-5	5.92E-5 +- 4.78E-6	0.0326 +- 3.6E-5	0.0047 +- 2.2E-5	0.1056 +- 1.4E-4	0.0172 +- 6.6E-6
Native CUDA (s)	0.073 +- 7.85E-6	7.13E-5 +- 6.67E-6	0.0327 +- 5.4E-5	0.0046 +- 2.2E-5	0.1057 +- 1.17E-5	0.0171 +- 3.6E-6

NVidia T4:

	syncThreads	w\o syncThreads	Global threadFence	w\o Global threadFence	Shared threadFence	w\o Shared threadFence
Alpaka-CUDA (s)	0.1553 +- 6.2E-6	7.68E-5 +- 3.1E-6	0.0436 +- 0.0019	0.0276 +- 0.0039	0.3617 +- 0.0165	0.1423 +- 0.0027
Native CUDA (s)	0.1534 +- 0.0039	7.81E-5 +- 7.1E-6	0.0426 +- 0.0021	0.0217 +- 0.0039	0.3545 +- 0.013	0.142 +- 0.0029

fwyzard · 2021-10-14T08:25:13Z

Rebased and fixed conflicts.

fwyzard · 2021-10-15T07:04:57Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

+          // increment the 3rd index and check its value
+          index_[2u] += 1;
+          if (index_[2u] == old_index_[2u] + blockDim[2u])
+            index_[2u] = old_index_[2u];
+
+          //  if the 3rd index was reset, increment the 2nd index
+          if (index_[2u] == old_index_[2u])
+            index_[1u] += 1;
+          if (index_[1u] == old_index_[1u] + blockDim[1u])
+            index_[1u] = old_index_[1u];
+
+          // if the 3rd and 2nd indices were set, increment the first coordinate
+          if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u])
+            index_[0u] += 1;
+
+          if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u]) {
+            return *this;
+          }


This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above: there the iteration is only over the 0th index, here is over all three indices.

fwyzard · 2021-10-15T07:05:33Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

+          // increment the 3rd index and check its value
+          index_[2u] += 1;
+          if (index_[2u] == old_index_[2u] + blockDim[2u])
+            index_[2u] = old_index_[2u];
+
+          //  if the 3rd index was reset, increment the 2nd index
+          if (index_[2u] == old_index_[2u])
+            index_[1u] += 1;
+          if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u])
+            index_[1u] = old_index_[1u];
+
+          // if the 3rd and 2nd indices were set, increment the first coordinate
+          if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u])
+            index_[0u] += 1;
+
+          if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u]) {
+            return *this;
+          }


This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above: there the iteration is only over the 0th and 1st indices, here is over all three indices.

The order of the increments also seems different.

fwyzard · 2021-10-15T07:07:07Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

+          // increment the 3rd index and check its value
+          index_[2u] += 1;
+          if (index_[2u] == old_index_[2u] + blockDim[2u] || index_[2u] == extent_[2u])
+            index_[2u] = old_index_[2u];
+
+          //  if the 3rd index was reset, increment the 2nd index
+          if (index_[2u] == old_index_[2u])
+            index_[1u] += 1;
+          if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u])
+            index_[1u] = old_index_[1u];
+
+          // if the 3rd and 2nd indices were set, increment the first coordinate
+          if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u])
+            index_[0u] += 1;
+          if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u] &&
+              index_[2u] < extent_[2u]) {
+            return *this;
+          }


The order of the increments is inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above.
Is this intended ?

…h_stride class

fwyzard · 2021-10-19T16:47:25Z

Rebased etc.
No impact on the performance and correctness (tested with the serial and cuda backaneds, while the tbb backend is currently broken).

Before

$ for X in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 4000; done
Processing 4000 events, of which 16 concurrently, with 16 threads.
Processed 4000 events in 1.656752e+01 seconds, throughput 241.436 events/s.
Processed 4000 events in 1.652297e+01 seconds, throughput 242.087 events/s.
Processed 4000 events in 1.637139e+01 seconds, throughput 244.329 events/s.
Processed 4000 events in 1.643147e+01 seconds, throughput 243.435 events/s.

After

$ for X in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 4000; done
Processing 4000 events, of which 16 concurrently, with 16 threads.
Processed 4000 events in 1.635800e+01 seconds, throughput 244.529 events/s.
Processed 4000 events in 1.641191e+01 seconds, throughput 243.726 events/s.
Processed 4000 events in 1.640130e+01 seconds, throughput 243.883 events/s.
Processed 4000 events in 1.639645e+01 seconds, throughput 243.955 events/s.

makortel added the alpaka label Jun 14, 2021

ghugo83 reviewed Jun 16, 2021

View reviewed changes

src/alpaka/test/alpaka/elements_stride.cc Outdated Show resolved Hide resolved

src/alpaka/test/alpaka/elements_stride.cc Outdated Show resolved Hide resolved

src/alpaka/test/alpaka/elements_stride.cc Outdated Show resolved Hide resolved

antoniopetre force-pushed the elements_stride branch from 79bb2fe to 4d7ee8a Compare June 16, 2021 18:23

makortel reviewed Jun 16, 2021

View reviewed changes

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h Outdated Show resolved Hide resolved

makortel reviewed Jun 17, 2021

View reviewed changes

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h Outdated Show resolved Hide resolved

antoniopetre force-pushed the elements_stride branch from e34151e to d52117d Compare June 21, 2021 19:13

antoniopetre marked this pull request as draft June 28, 2021 08:27

antoniopetre force-pushed the elements_stride branch 3 times, most recently from d216d22 to 5c4c43d Compare June 29, 2021 09:58

antoniopetre force-pushed the elements_stride branch from 1b9a70e to d6588cf Compare July 12, 2021 10:30

antoniopetre force-pushed the elements_stride branch 6 times, most recently from a8eb537 to e2a58ab Compare August 3, 2021 14:33

antoniopetre force-pushed the elements_stride branch 2 times, most recently from 8e0162e to cae4cfd Compare August 16, 2021 23:13

antoniopetre force-pushed the elements_stride branch 2 times, most recently from 1667920 to 2c7c455 Compare September 2, 2021 10:38

antoniopetre marked this pull request as ready for review September 2, 2021 10:39

antoniopetre force-pushed the elements_stride branch 2 times, most recently from 244006d to bed0543 Compare September 10, 2021 09:05

fwyzard force-pushed the elements_stride branch from b5e83b0 to 7184d76 Compare October 14, 2021 08:24

fwyzard force-pushed the elements_stride branch from 7184d76 to 7b4051a Compare October 14, 2021 22:13

fwyzard reviewed Oct 15, 2021

View reviewed changes

antoniopetre and others added 5 commits October 19, 2021 17:53

[alpaka] Add element_stride class and test

1fd17c1

[alpaka] Refactor CAHitNtupletGeneratorKernelsImpl using elements_wit…

146f102

…h_stride class

[alpaka/cuda] Add tests for atomic and barriers

ac63a9f

[alpaka] Simplify elements_with_stride template parameters

9c905e8

[alpaka] Avoid narrowing conversions in elements_with_stride_1d/2d

37df2db

fwyzard force-pushed the elements_stride branch from 194c43d to 37df2db Compare October 19, 2021 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alpaka] add element_stride class and test #190

[alpaka] add element_stride class and test #190

antoniopetre commented Jun 14, 2021

makortel left a comment

makortel Jun 14, 2021

makortel Jun 16, 2021

fwyzard Oct 15, 2021

antoniopetre commented Jul 13, 2021 •

edited

Loading

fwyzard commented Oct 14, 2021

fwyzard Oct 15, 2021

fwyzard Oct 15, 2021

fwyzard Oct 15, 2021

fwyzard Oct 15, 2021

fwyzard commented Oct 19, 2021

[alpaka] add element_stride class and test #190

Are you sure you want to change the base?

[alpaka] add element_stride class and test #190

Conversation

antoniopetre commented Jun 14, 2021

makortel left a comment

Choose a reason for hiding this comment

makortel Jun 14, 2021

Choose a reason for hiding this comment

makortel Jun 16, 2021

Choose a reason for hiding this comment

fwyzard Oct 15, 2021

Choose a reason for hiding this comment

antoniopetre commented Jul 13, 2021 • edited Loading

fwyzard commented Oct 14, 2021

fwyzard Oct 15, 2021

Choose a reason for hiding this comment

fwyzard Oct 15, 2021

Choose a reason for hiding this comment

fwyzard Oct 15, 2021

Choose a reason for hiding this comment

fwyzard Oct 15, 2021

Choose a reason for hiding this comment

fwyzard commented Oct 19, 2021

Before

After

antoniopetre commented Jul 13, 2021 •

edited

Loading