-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alpaka] add element_stride class and test #190
base: master
Are you sure you want to change the base?
Conversation
79bb2fe
to
4d7ee8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes from first read(s).
The code should also be formatted with clang-format
(in principle even I can do it just before merge).
* Class which simplifies "for" loops over elements index | ||
*/ | ||
template <typename T, typename T_Acc> | ||
class elements_with_stride { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relationship between elements_with_stride
and elements_with_stride_<N>d
is not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the addition of dimIndex
argument, nevermind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that elements_with_stride
should loop over a single index using a scalar variable; usually this is the 0th index (assuming a one-dimensional kernel) but it can be chosen by dimIndex
.
While elements_with_stride_<N>d
should loop over an N-dimensional space using a Vec3D
variable.
In fact, after having clarified that the platform and device are independent from the dimensionality, it makes sense to change elements_with_stride_<N>d
to use a Vec<N>D
instead of always a Vec3D
.
e34151e
to
d52117d
Compare
d216d22
to
5c4c43d
Compare
1b9a70e
to
d6588cf
Compare
I will show the comparison between Alpaka-CUDA and Native CUDA for atomics and barriers. I used 10 running times to get the average and the standard deviation. The tests used were added in this PR as well. For atomics, I used 256 threads/block : NVidia V100:
NVidia T4:
For the syncThreads test, I used 1024 threads/block. For the threadfence, I used 256 threads/block: NVidia V100:
NVidia T4:
|
a8eb537
to
e2a58ab
Compare
8e0162e
to
cae4cfd
Compare
1667920
to
2c7c455
Compare
244006d
to
bed0543
Compare
b5e83b0
to
7184d76
Compare
Rebased and fixed conflicts. |
7184d76
to
7b4051a
Compare
// increment the 3rd index and check its value | ||
index_[2u] += 1; | ||
if (index_[2u] == old_index_[2u] + blockDim[2u]) | ||
index_[2u] = old_index_[2u]; | ||
|
||
// if the 3rd index was reset, increment the 2nd index | ||
if (index_[2u] == old_index_[2u]) | ||
index_[1u] += 1; | ||
if (index_[1u] == old_index_[1u] + blockDim[1u]) | ||
index_[1u] = old_index_[1u]; | ||
|
||
// if the 3rd and 2nd indices were set, increment the first coordinate | ||
if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u]) | ||
index_[0u] += 1; | ||
|
||
if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u]) { | ||
return *this; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED
case above: there the iteration is only over the 0th index, here is over all three indices.
// increment the 3rd index and check its value | ||
index_[2u] += 1; | ||
if (index_[2u] == old_index_[2u] + blockDim[2u]) | ||
index_[2u] = old_index_[2u]; | ||
|
||
// if the 3rd index was reset, increment the 2nd index | ||
if (index_[2u] == old_index_[2u]) | ||
index_[1u] += 1; | ||
if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u]) | ||
index_[1u] = old_index_[1u]; | ||
|
||
// if the 3rd and 2nd indices were set, increment the first coordinate | ||
if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u]) | ||
index_[0u] += 1; | ||
|
||
if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u]) { | ||
return *this; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED
case above: there the iteration is only over the 0th and 1st indices, here is over all three indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the increments also seems different.
// increment the 3rd index and check its value | ||
index_[2u] += 1; | ||
if (index_[2u] == old_index_[2u] + blockDim[2u] || index_[2u] == extent_[2u]) | ||
index_[2u] = old_index_[2u]; | ||
|
||
// if the 3rd index was reset, increment the 2nd index | ||
if (index_[2u] == old_index_[2u]) | ||
index_[1u] += 1; | ||
if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u]) | ||
index_[1u] = old_index_[1u]; | ||
|
||
// if the 3rd and 2nd indices were set, increment the first coordinate | ||
if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u]) | ||
index_[0u] += 1; | ||
if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u] && | ||
index_[2u] < extent_[2u]) { | ||
return *this; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the increments is inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED
case above.
Is this intended ?
194c43d
to
37df2db
Compare
Rebased etc. Before
After
|
The new classes implement
range-based for loop
for elements indices. In addition, I added a test for the new classes.