Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alpaka] add element_stride class and test #190

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

antoniopetre
Copy link
Contributor

The new classes implement range-based for loop for elements indices. In addition, I added a test for the new classes.

src/alpaka/test/alpaka/elements_stride.cc Outdated Show resolved Hide resolved
src/alpaka/test/alpaka/elements_stride.cc Outdated Show resolved Hide resolved
src/alpaka/test/alpaka/elements_stride.cc Outdated Show resolved Hide resolved
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes from first read(s).

The code should also be formatted with clang-format (in principle even I can do it just before merge).

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h Outdated Show resolved Hide resolved
src/alpaka/AlpakaCore/alpakaWorkDivHelper.h Outdated Show resolved Hide resolved
src/alpaka/AlpakaCore/alpakaWorkDivHelper.h Outdated Show resolved Hide resolved
* Class which simplifies "for" loops over elements index
*/
template <typename T, typename T_Acc>
class elements_with_stride {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between elements_with_stride and elements_with_stride_<N>d is not clear to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the addition of dimIndex argument, nevermind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that elements_with_stride should loop over a single index using a scalar variable; usually this is the 0th index (assuming a one-dimensional kernel) but it can be chosen by dimIndex.
While elements_with_stride_<N>d should loop over an N-dimensional space using a Vec3D variable.

In fact, after having clarified that the platform and device are independent from the dimensionality, it makes sense to change elements_with_stride_<N>d to use a Vec<N>D instead of always a Vec3D.

src/alpaka/test/alpaka/elements_stride.cc Show resolved Hide resolved
@antoniopetre antoniopetre marked this pull request as draft June 28, 2021 08:27
@antoniopetre antoniopetre force-pushed the elements_stride branch 3 times, most recently from d216d22 to 5c4c43d Compare June 29, 2021 09:58
@antoniopetre
Copy link
Contributor Author

antoniopetre commented Jul 13, 2021

I will show the comparison between Alpaka-CUDA and Native CUDA for atomics and barriers. I used 10 running times to get the average and the standard deviation. The tests used were added in this PR as well.

For atomics, I used 256 threads/block :

NVidia V100:

Alpaka Atomic ~32k elems Global Block Shared Block Global Grid Shared Grid
Native CUDA SP (s) 7.3425 +- 0.0859 3.2331 +- 0.0079 29.9642 +- 0.0078 3.2767 +- 0.018
Native CUDA DP (s) 6.1387 +- 0.0447 3.9586 +- 0.0112 29.9637 +- 0.0047 3.9407 +- 0.0038
Alpaka-GPU SP (s) 7.3074 +- 0.057 3.332 +- 0.0104 29.9591 +- 0.0006 3.3435 +- 0.0168
Alpaka-GPU DP (s) 6.1735 +- 0.0266 3.9393 +- 0.0031 29.9601 +- 0.0042 3.9381 +- 0.0001
Alpaka-CPU Serial ~1k elements (s) 7.6338 +- 0.1422 0.5545 +- 0.0048 7.5430 +- 0.0548 0.5547 +- 0.0087

NVidia T4:

Alpaka Atomic ~32k elems Global Block Shared Block Global Grid Shared Grid
Native CUDA SP (s) 7.2224 +- 0.0329 6.1258 +- 0.0261 25.9548 +- 1.7E-5 5.8128 +- 0.0151
Native CUDA DP (s) 5.776 +- 0.01384 32.522 +- 0.1215 25.9548 +- 1.6E-5 32.3397 +- 0.1153
Alpaka-GPU SP (s) 7.2365 +- 0.0278 5.983 +- 0.0436 25.9548 +- 4E-5 5.8331 +- 0.0443
Alpaka-GPU DP (s) 5.7868 +- 0.0120 32.4997 +- 0.1435 25.9548 +- 4.6E-5 32.4087 +- 0.1177
Alpaka-CPU Serial ~32k elements (s) (first results) 254.906 18.2051 259.499 18.8554

For the syncThreads test, I used 1024 threads/block. For the threadfence, I used 256 threads/block:

NVidia V100:

syncThreads w\o syncThreads Global threadFence w\o Global threadFence Shared threadFence w\o Shared threadFence
Alpaka-CUDA (s) 0.072 +- 1.37E-5 5.92E-5 +- 4.78E-6 0.0326 +- 3.6E-5 0.0047 +- 2.2E-5 0.1056 +- 1.4E-4 0.0172 +- 6.6E-6
Native CUDA (s) 0.073 +- 7.85E-6 7.13E-5 +- 6.67E-6 0.0327 +- 5.4E-5 0.0046 +- 2.2E-5 0.1057 +- 1.17E-5 0.0171 +- 3.6E-6

NVidia T4:

syncThreads w\o syncThreads Global threadFence w\o Global threadFence Shared threadFence w\o Shared threadFence
Alpaka-CUDA (s) 0.1553 +- 6.2E-6 7.68E-5 +- 3.1E-6 0.0436 +- 0.0019 0.0276 +- 0.0039 0.3617 +- 0.0165 0.1423 +- 0.0027
Native CUDA (s) 0.1534 +- 0.0039 7.81E-5 +- 7.1E-6 0.0426 +- 0.0021 0.0217 +- 0.0039 0.3545 +- 0.013 0.142 +- 0.0029

@antoniopetre antoniopetre force-pushed the elements_stride branch 6 times, most recently from a8eb537 to e2a58ab Compare August 3, 2021 14:33
@antoniopetre antoniopetre force-pushed the elements_stride branch 2 times, most recently from 8e0162e to cae4cfd Compare August 16, 2021 23:13
@antoniopetre antoniopetre force-pushed the elements_stride branch 2 times, most recently from 1667920 to 2c7c455 Compare September 2, 2021 10:38
@antoniopetre antoniopetre marked this pull request as ready for review September 2, 2021 10:39
@antoniopetre antoniopetre force-pushed the elements_stride branch 2 times, most recently from 244006d to bed0543 Compare September 10, 2021 09:05
@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2021

Rebased and fixed conflicts.

Comment on lines 470 to 487
// increment the 3rd index and check its value
index_[2u] += 1;
if (index_[2u] == old_index_[2u] + blockDim[2u])
index_[2u] = old_index_[2u];

// if the 3rd index was reset, increment the 2nd index
if (index_[2u] == old_index_[2u])
index_[1u] += 1;
if (index_[1u] == old_index_[1u] + blockDim[1u])
index_[1u] = old_index_[1u];

// if the 3rd and 2nd indices were set, increment the first coordinate
if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u])
index_[0u] += 1;

if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u]) {
return *this;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above: there the iteration is only over the 0th index, here is over all three indices.

Comment on lines 592 to 609
// increment the 3rd index and check its value
index_[2u] += 1;
if (index_[2u] == old_index_[2u] + blockDim[2u])
index_[2u] = old_index_[2u];

// if the 3rd index was reset, increment the 2nd index
if (index_[2u] == old_index_[2u])
index_[1u] += 1;
if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u])
index_[1u] = old_index_[1u];

// if the 3rd and 2nd indices were set, increment the first coordinate
if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u])
index_[0u] += 1;

if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u]) {
return *this;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above: there the iteration is only over the 0th and 1st indices, here is over all three indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the increments also seems different.

Comment on lines 727 to 744
// increment the 3rd index and check its value
index_[2u] += 1;
if (index_[2u] == old_index_[2u] + blockDim[2u] || index_[2u] == extent_[2u])
index_[2u] = old_index_[2u];

// if the 3rd index was reset, increment the 2nd index
if (index_[2u] == old_index_[2u])
index_[1u] += 1;
if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u])
index_[1u] = old_index_[1u];

// if the 3rd and 2nd indices were set, increment the first coordinate
if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u])
index_[0u] += 1;
if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u] &&
index_[2u] < extent_[2u]) {
return *this;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the increments is inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above.
Is this intended ?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 19, 2021

Rebased etc.
No impact on the performance and correctness (tested with the serial and cuda backaneds, while the tbb backend is currently broken).

Before

$ for X in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 4000; done
Processing 4000 events, of which 16 concurrently, with 16 threads.
Processed 4000 events in 1.656752e+01 seconds, throughput 241.436 events/s.
Processed 4000 events in 1.652297e+01 seconds, throughput 242.087 events/s.
Processed 4000 events in 1.637139e+01 seconds, throughput 244.329 events/s.
Processed 4000 events in 1.643147e+01 seconds, throughput 243.435 events/s.

After

$ for X in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 4000; done
Processing 4000 events, of which 16 concurrently, with 16 threads.
Processed 4000 events in 1.635800e+01 seconds, throughput 244.529 events/s.
Processed 4000 events in 1.641191e+01 seconds, throughput 243.726 events/s.
Processed 4000 events in 1.640130e+01 seconds, throughput 243.883 events/s.
Processed 4000 events in 1.639645e+01 seconds, throughput 243.955 events/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants