-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ShflUp, ShflDown and ShflXor #1924
Conversation
I assume older CUDA supported shuffle only for these types and the current version is supporting any type |
IMO As you suggested emulating the shufl_down, ... for all sub groups different than |
Ok, is it ok if I replace the two existing methods with the template? |
I'll try to implement it then |
I would keep it like it is, it looks like AMD is still only shipping float and int signatures for this function. |
Actually, I see the other types on a local ROCm installation:
|
363f5b7
to
0f996fc
Compare
I've rebased it and I think it is ready for review |
include/alpaka/warp/Traits.hpp
Outdated
{ | ||
using ImplementationBase = concepts::ImplementationBase<ConceptWarp, TWarp>; | ||
return trait::Shfl<ImplementationBase>::shfl(warp, value, srcLane, width ? width : getSize(warp)); | ||
} | ||
|
||
//! shfl for float vals | ||
//! Exchange data between threads within a warp. | ||
//! It copy from a lane with lower ID relative to caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either
//! It copy from a lane with lower ID relative to caller. | |
//! It copies from a lane with lower ID relative to caller. |
or
//! It copy from a lane with lower ID relative to caller. | |
//! Copy from a lane with lower ID relative to caller. |
include/alpaka/warp/Traits.hpp
Outdated
//! __shared__ int32_t values[warpsize]; | ||
//! values[threadIdx.x] = value; | ||
//! __syncthreads(); | ||
//! return values[(-delta + width*floor(threadIdx.x/width))%width]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formula seems wrong:
threadIdx.x/width
will be an integer between0
andblockDim.x / width
- passing it through
floor(...)
is unnecessary, and will be the same value width * floor(threadIdx.x/width)
will be a multiple ofwidth
(-delta + width*floor(threadIdx.x/width))
or(width*floor(threadIdx.x/width) - delta)
will be a multiple ofwidth
, minusdelta
- then,
(...) % width
is equivalent to(width - delta) % width
which is the same value independently of threadIdx.x
.
10feef9
to
3d2370a
Compare
include/alpaka/warp/Traits.hpp
Outdated
//! __shared__ int32_t values[warpsize]; | ||
//! values[threadIdx.x] = value; | ||
//! __syncthreads(); | ||
//! return values[width*(threadIdx.x/width) + threadIdx.x%width - delta]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
width*(threadIdx.x/width) + threadIdx.x%width - delta
is equal to threadIdx.x - delta
, so this is simply
//! return values[width*(threadIdx.x/width) + threadIdx.x%width - delta]; | |
//! return values[threadIdx.x - delta]; |
What was it supposed to be ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be something like
//! return (threadIdx.x % width >= delta) ? values[threadIdx.x - delta] : values[threadIdx.x];
int srcLane, | ||
std::int32_t width) -> float | ||
std::int32_t width) -> T | ||
{ | ||
# if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) | ||
return __shfl_sync(activemask(warp), val, srcLane, width); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this was already there, but shouldn't this be
return __shfl_sync(0xffffffff, val, srcLane, width);
instead ?
Possible preceded by a __syncwarp()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from #1273.
@psychocoderHPC what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, HIP does
#if CUDA_VERSION >= CUDA_9000
#define __shfl(...) __shfl_sync(0xffffffff, __VA_ARGS__)
#define __shfl_up(...) __shfl_up_sync(0xffffffff, __VA_ARGS__)
#define __shfl_down(...) __shfl_down_sync(0xffffffff, __VA_ARGS__)
#define __shfl_xor(...) __shfl_xor_sync(0xffffffff, __VA_ARGS__)
#endif // CUDA_VERSION >= CUDA_9000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with other backends may be return __shfl_sync(0xffffffff, val, srcLane, width);
but I am not fully sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SimeonEhrig the OSX debug builds are failing with
Is it expected ? |
std::int32_t offset_int = static_cast<std::int32_t>(offset); | ||
auto const actual_group = warp.m_item_warp.get_sub_group(); | ||
auto actual_item_id = static_cast<std::int32_t>(actual_group.get_local_linear_id()); | ||
auto const actual_group_id = actual_item_id / width; | ||
auto const actual_src_id = actual_item_id - offset_int; | ||
auto const src = actual_src_id >= actual_group_id * width | ||
? sycl::id<1>{static_cast<std::size_t>(actual_src_id)} | ||
: sycl::id<1>{static_cast<std::size_t>(actual_item_id)}; | ||
return sycl::select_from_group(actual_group, value, src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AuroraPerego do you think it would be worth checking if width
is the same as get_sub_group().get_max_local_range()
, and in that case call shift_group_right(actual_group, value, offset)
directly, instead of select_from_group
?
{ | ||
std::int32_t offset_int = static_cast<std::int32_t>(offset); | ||
auto const actual_group = warp.m_item_warp.get_sub_group(); | ||
auto actual_item_id = static_cast<std::int32_t>(actual_group.get_local_linear_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one can also be const
, since all the other ones are.
warp::WarpGenericSycl<TDim> const& warp, | ||
T value, | ||
std::int32_t mask, | ||
std::int32_t /*width*/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA version does make use of width:
__shfl_xor_sync()
calculates a source line ID by performing a bitwise XOR of the caller’s lane ID withlaneMask
: the value ofvar
held by the resulting lane ID is returned. Ifwidth
is less thanwarpSize
then each group ofwidth
consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value ofvar
will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.
Can you implement the same behaviour ?
// /* If width < srcLane the sub-group needs to be split into assumed subdivisions. The first item of | ||
// each | ||
// subdivision has the assumed index 0. The srcLane index is relative to the subdivisions. | ||
|
||
Example: If we assume a sub-group size of 32 and a width of 16 we will receive two subdivisions: | ||
The first starts at sub-group index 0 and the second at sub-group index 16. For srcLane = 4 the | ||
first subdivision will access the value at sub-group index 4 and the second at sub-group index 20. */ | ||
// Example: If we assume a sub-group size of 32 and a width of 16 we will receive two subdivisions: | ||
// The first starts at sub-group index 0 and the second at sub-group index 16. For srcLane = 4 the | ||
// first subdivision will access the value at sub-group index 4 and the second at sub-group | ||
// index 20. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the double comment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
auto const actual_src_id = static_cast<std::size_t>(srcLane + actual_group_id * width); | ||
auto const src = sycl::id<1>{actual_src_id}; | ||
std::uint32_t const w = static_cast<std::uint32_t>(width); | ||
unsigned int const start_index = actual_group.get_local_linear_id() / w * w; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned int const start_index = actual_group.get_local_linear_id() / w * w; | |
std::uint32_t const start_index = actual_group.get_local_linear_id() / w * w; |
for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed also the others
2babf41
to
7b92bcd
Compare
std::int32_t width) | ||
{ | ||
auto const actual_group = warp.m_item_warp.get_sub_group(); | ||
std::uint32_t const w = static_cast<std::uint32_t>(width); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you simply use width
directly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it gives a warning when doing operations with the thread id that is unsigned:
conversion to 'uint32_t' {aka 'unsigned int'} from 'int32_t' {aka 'int'} may change the sign of the result [-Wsign-conversion]
I think you should update the license at the top of /* Copyright 2023 Jan Stephan, Luca Ferragina, Andrea Bocci, Aurora Perego
* SPDX-License-Identifier: MPL-2.0
+ *
+ * The implementations of Shfl::shfl(), ShflUp::shfl_up(), ShflDown::shfl_down() and ShflXor::shfl_xor() are derived from Intel DPCT.
+ * Copyright (C) Intel Corporation
+ * SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+ * See https://llvm.org/LICENSE.txt for license information.
*/ |
63be4dd
to
063dfe1
Compare
@psychocoderHPC can you merge this PR ? |
I removed the XCode test from the require test list and will have a short look to this PR. |
Following the logic of
alpaka::shfl
, I've implemented also the other three methods:alpaka::shfl_up
,alpaka::shfl_down
,alpaka::shfl_xor
.I've also added the tests for those methods.
I have two comments on that:
alpaka::shfl
method and having defined two different methods, one forstd::int_32t
and the other one forfloat
? (btw I think that alsounsigned
,long
anddouble
should be added if we don't want to template the methods)scrLane
based on the parameters given in input toalpaka::shfl_up
,alpaka::shfl_down
,alpaka::shfl_xor
and then usealpaka::shfl
whenwidth != sub_group_size
.Last thing, when running the tests I noticed that, when selecting the CUDA backend, sometimes the kernel is not executed (I put an
assert(false)
in a kernel and all the tests passed without crashing). I don't know why, maybe that could be discussed in a separated issue.