Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify par_dispatch, par_for_outer & par_for_inner overloads #1142

Open
wants to merge 113 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
6995d11
wrap 3D flat loop abstractions
acreyes Jun 22, 2024
2e61847
add 4D loop and test
acreyes Jun 22, 2024
054ca0e
add specialization for const int &
Jun 22, 2024
4c83c4c
added mdrange loops to par_reduce tests
Jun 22, 2024
0730429
refactor flatloop specialization
acreyes Jun 22, 2024
adb15dd
clean up
acreyes Jun 23, 2024
5db0d34
formatting
acreyes Jun 23, 2024
ba335d7
linting
acreyes Jun 23, 2024
2fa8ad1
templating functor index types
acreyes Jun 23, 2024
b9a95a1
moved to a single functor
acreyes Jun 24, 2024
9033aa3
Update CHANGELOG.md
acreyes Jun 24, 2024
2d450c8
Merge branch 'develop' into acreyes/par_reduce-flatloops
acreyes Jun 25, 2024
503ae0c
first pass, doesn't like 4D loops
acreyes Jun 25, 2024
e01dabf
formatting
acreyes Jun 25, 2024
259115d
added overload for index ranges
acreyes Jun 25, 2024
ad53f40
generic par_dispatch for all flatrange loops
acreyes Jun 26, 2024
442c04f
wrapped 2D MDRange loop
acreyes Jun 26, 2024
a865f00
wrapped rest of MDRange loops
acreyes Jun 27, 2024
dd46b7e
enabled all simd loops
acreyes Jun 27, 2024
7c7ecc0
cleaning up some warnings
Jun 27, 2024
53f0f85
Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…
acreyes Jul 27, 2024
e9b440d
cleaning up templates & traits
acreyes Jun 27, 2024
2452b48
adding loop collapse patterns
acreyes Jun 29, 2024
188d413
wrapped team policy loops
acreyes Jul 1, 2024
5bb7764
separate inner loop collapses
acreyes Jul 1, 2024
1102470
Wrapping inner par_for loops
acreyes Jul 1, 2024
049bf52
simdfor inner loops
acreyes Jul 2, 2024
9a39c02
formatting
acreyes Jul 27, 2024
08e788f
cleaning up
acreyes Jul 27, 2024
8d1a5ca
infer loop rank from launch bounds rather than functor signature
acreyes Jul 27, 2024
3a32e84
helper type DispatchType to hold useuful parameters
acreyes Jul 27, 2024
c2ac94f
done with par_for_outer
acreyes Jul 27, 2024
d7477c7
cleanup
acreyes Jul 27, 2024
69cda38
adding tests
acreyes Jul 28, 2024
12297d2
completeing tests for par_for & par_reduce. Testing up to 7D loops
acreyes Jul 28, 2024
0905832
fixing sequence_of_ones
acreyes Jul 28, 2024
afc86c1
static_assert for par_scan
acreyes Jul 28, 2024
ceaac8f
modifying test to work on cuda machine
acreyes Jul 29, 2024
de6df61
workaround non-type template parameter usage
acreyes Jul 30, 2024
5954079
Merge branch 'parthenon-hpc-lab:develop' into acreyes/par-dispatch-te…
acreyes Aug 8, 2024
1d7719c
moved TypeList to those in type_list.hpp
acreyes Aug 8, 2024
7b25d89
use correct ThreadVectorRange policy
acreyes Aug 15, 2024
8ab0985
Merge branch 'parthenon-hpc-lab:develop' into acreyes/par-dispatch-te…
acreyes Aug 15, 2024
c16fc6e
cleanup
acreyes Aug 15, 2024
a25ffef
template execution space
acreyes Aug 15, 2024
7d49d4e
linting
acreyes Aug 15, 2024
72aa437
put simd pragma in correct place
acreyes Aug 15, 2024
ed0be07
ContinuousSubListImpl made public
acreyes Aug 15, 2024
fb5ccb2
Borrowing features from upstream/lroberts36/generalize-par-dispatch
acreyes Aug 21, 2024
aae696c
rebuilding dispatch signature
acreyes Aug 21, 2024
8fb8f5c
simdfor using indexer
acreyes Aug 21, 2024
066b3d7
generalized tag disptach
acreyes Aug 21, 2024
54a3c01
dispatch_collapse added for team patterns
acreyes Aug 21, 2024
b2a6e49
par_for_inner with new dispatch
acreyes Aug 21, 2024
5e3e8b2
mdrange dispatch
acreyes Aug 21, 2024
d626fed
array indexer indices
acreyes Aug 22, 2024
565717f
fix team policy scratch size
acreyes Aug 22, 2024
c53ec4f
base_type in functor check
acreyes Aug 22, 2024
14c098a
simdfor inner loop bounds fix
acreyes Aug 22, 2024
f6d9c21
get things working on cuda
acreyes Aug 23, 2024
be6ed04
extra args for hierarchial loops
acreyes Aug 24, 2024
b091860
extra args in team/collapse loops
acreyes Aug 24, 2024
b532cc7
kokkos array in indexer
Aug 24, 2024
8a74cba
cleanup
Aug 24, 2024
64cf179
fix unit test
acreyes Aug 24, 2024
001108f
static_asserts for simdfor loop patterns + seq_for
acreyes Aug 26, 2024
4662b1b
Merge branch 'acreyes/par-dispatch-template-temp' into acreyes/par-di…
acreyes Aug 26, 2024
520811a
Merge branch 'parthenon-hpc-lab:develop' into acreyes/par-dispatch-te…
acreyes Aug 26, 2024
bd9df9c
linting
acreyes Aug 27, 2024
ebae3cd
cleaning up & moving loop bounds to their own header
acreyes Aug 27, 2024
48664ac
fix unit tests for cuda 11.4
acreyes Aug 28, 2024
4887cfc
fixing missing return warnings from if constexpr blocks
acreyes Aug 28, 2024
afc381c
Merge branch 'develop' into acreyes/par-dispatch-template
Yurlungur Sep 5, 2024
842b94d
fix simdfor pattern check
acreyes Sep 5, 2024
beb6847
have Indexer only return Kokkos::Array
acreyes Sep 5, 2024
f68ec09
move kokkos typedefs to their own header
acreyes Sep 5, 2024
9aa1560
collapse pattern infers Nteam from Rank
acreyes Sep 5, 2024
f2cfd91
clean up team/collapse patterns
acreyes Sep 5, 2024
f255f71
making simdfor pattern uniform with others
acreyes Sep 6, 2024
c6f9de1
adding comments and cleaning up LoopPattern* type name use
acreyes Sep 6, 2024
f1403b1
SequentialFor added
acreyes Sep 6, 2024
b31b1dc
use Indexer in unit tests
acreyes Sep 6, 2024
4fbef8f
adding test coverage for outer loop patterns
acreyes Sep 6, 2024
0ee6005
fix par_for_outer tests on gpu
acreyes Sep 6, 2024
6e37753
cleanup scratch memory for par_for_outer test
acreyes Sep 6, 2024
1976164
adding example par_for* test cases
acreyes Sep 6, 2024
7c20e6d
LoopPatternCollapse -> LoopPatternTeamThreadVec
acreyes Sep 7, 2024
8758aef
cleaning up LoopBounds Translator
acreyes Sep 7, 2024
4fa3932
workaround for rtd theme table box line breaks
acreyes Sep 7, 2024
ce39ccc
par_for docs
acreyes Sep 7, 2024
a450c7c
count loop bounds more sensibly
acreyes Sep 7, 2024
85568d6
Revert "workaround for rtd theme table box line breaks"
acreyes Sep 8, 2024
052e39f
ParArray<ND, T> moved to kokkos_types.hpp
acreyes Sep 9, 2024
ae7a3b9
move is_functor -> concepts_lite.hpp
acreyes Sep 9, 2024
562b396
fix a typo
acreyes Sep 9, 2024
034efed
update changelog
acreyes Sep 9, 2024
ccf3e2d
fix scratch pad initialization for gcc 9.4
acreyes Sep 10, 2024
f4b2141
Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…
acreyes Sep 13, 2024
4cb73d1
Merge branch 'develop' into acreyes/par-dispatch-template
acreyes Sep 24, 2024
fbd2674
Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…
acreyes Oct 31, 2024
b175f41
Merge branch 'develop' into acreyes/par-dispatch-template
acreyes Nov 1, 2024
11616d8
fix merged test for device
acreyes Nov 13, 2024
1661218
Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…
acreyes Nov 13, 2024
e6c7b93
Merge branch 'develop' into acreyes/par-dispatch-template
pgrete Dec 2, 2024
28c3bee
fix for HtoD copies in par_for_outer
acreyes Dec 6, 2024
3dfc5cf
add mdrange tiling
acreyes Dec 6, 2024
2de2eab
docs for ParArray<ND, T>
acreyes Dec 6, 2024
206cd31
Update src/kokkos_abstraction.hpp
acreyes Dec 6, 2024
7bc731e
Update doc/sphinx/src/par_for.rst
acreyes Dec 6, 2024
c448372
Update doc/sphinx/src/par_for.rst
acreyes Dec 6, 2024
6152149
Merge branch 'develop' into acreyes/par-dispatch-template
Yurlungur Dec 7, 2024
10f3bc2
check par_reduce for hierarchical patterns
acreyes Dec 7, 2024
279b126
fill tiling array
acreyes Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/sphinx/src/par_for.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ When ommitted the ``DEFAULT_LOOP_PATTERN`` is used.
- Flattens all of the loops into a single ``Kokkos::RangePolicy``
* - ``loop_pattern_simdfor_tag``
- Maps to two C-style loops. The innermost gets decorated with a ``#pragma omp simd`` and the remaining
loops are flattened into a single C-style for looop. Only supported on CPU.
loops are flattened into a single C-style for loop. Only supported on CPU.
* - ``loop_pattern_mdrange_tag``
- Maps all the loop bounds onto a ``Kokkos::MDRangePolicy``
* - ``LoopPatternTeamThreadVec<Nt, Nv>()``
- Maps onto a hierarchial parrallel loop. The ``Nv`` inner loops are flattened onto a ``VectorRange`` policy,
- Maps onto a hierarchical parallel loop. The ``Nv`` inner loops are flattened onto a ``VectorRange`` policy,
the next ``Nt`` onto a ``ThreadRange`` policy, and the remaining loops are
flattened into an outer ``TeamThreadRange``. The specializations ``loop_pattern_[tpttr|tptvr|tpttrtvr]_tag`` correspond
to ``<1,0>``, ``<0,1>``, ``<1,1>`` respectively.
Expand Down
6 changes: 6 additions & 0 deletions doc/sphinx/src/parthenon_arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,12 @@ where ``LayoutWrapper`` is currently hardcoded to
with the default execution space. If UVM is enabled, it is
``Kokkos::CudaUVMSpace``.

The above ``ParArray#D``s are also aliased with an integral template parameter

.. code:: c++
ParArray3D<Real> myArray("a 3d array", 6, 5, 4);
ParArray<3, Real> myArray2("same type as myArray", 6, 5, 4);

Parthenon Arbitrary-Dimensional Arrays
======================================

Expand Down
51 changes: 30 additions & 21 deletions src/kokkos_abstraction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -105,26 +105,26 @@ constexpr InnerLoopPatternSimdFor inner_loop_pattern_simdfor_tag;

// trait to track if pattern requests any type of hierarchial parallelism
template <typename Pattern, typename T = void>
struct UsesHierarchialPar : std::false_type {
struct UsesHierarchicalPar : std::false_type {
static constexpr std::size_t Nvector = 0;
static constexpr std::size_t Nthread = 0;
};

template <std::size_t num_thread, std::size_t num_vector>
struct UsesHierarchialPar<LoopPatternTeamThreadVec<num_thread, num_vector>>
struct UsesHierarchicalPar<LoopPatternTeamThreadVec<num_thread, num_vector>>
: std::true_type {
static constexpr std::size_t Nthread = num_thread;
static constexpr std::size_t Nvector = num_vector;
};

template <>
struct UsesHierarchialPar<OuterLoopPatternTeams> : std::true_type {
struct UsesHierarchicalPar<OuterLoopPatternTeams> : std::true_type {
static constexpr std::size_t Nvector = 0;
static constexpr std::size_t Nthread = 0;
};

template <std::size_t num_vector>
struct UsesHierarchialPar<InnerLoopThreadVec<num_vector>> : std::true_type {
struct UsesHierarchicalPar<InnerLoopThreadVec<num_vector>> : std::true_type {
static constexpr std::size_t Nvector = num_vector;
};

Expand Down Expand Up @@ -191,10 +191,12 @@ struct DispatchType {
using Translator = LoopBoundTranslator<Bounds...>;
static constexpr std::size_t Rank = Translator::Rank;

using HierarchialPar = UsesHierarchialPar<Pattern>;
using HierarchicalPar = UsesHierarchicalPar<Pattern>;

static constexpr bool is_ParFor =
std::is_same<Tag, dispatch_impl::ParallelForDispatch>::value;
static constexpr bool is_ParRed =
std::is_same<Tag, dispatch_impl::ParallelReduceDispatch>::value;
static constexpr bool is_ParScan =
std::is_same<Tag, dispatch_impl::ParallelScanDispatch>::value;

Expand All @@ -219,11 +221,12 @@ struct DispatchType {
// for now this is guaranteed to be par_for_inner, when par_reduce_inner is
// supported need to check
return PT::simd;
} else if constexpr (IsMDRange) {
} else if constexpr (IsMDRange || is_ParRed) {
// par_reduce does not currently work with either team-based patterns
return PT::md;
} else if constexpr (std::is_same_v<Pattern, OuterLoopPatternTeams>) {
return PT::outer;
} else if constexpr (HierarchialPar::value) {
} else if constexpr (HierarchicalPar::value) {
return PT::collapse;
}

Expand Down Expand Up @@ -332,7 +335,7 @@ struct dispatch_collapse {
};

// builds a functor that uses inner hierarchial parrallelism used by both par_disp_inner &
// par_dipsatch for LoopPatternCollapse
// par_dispatch for LoopPatternCollapse
template <std::size_t Rank, std::size_t Nteam, std::size_t Nthread, std::size_t Nvector,
typename IdxTeam, typename Function, typename... ExtraFuncArgs>
KOKKOS_FORCEINLINE_FUNCTION auto
Expand Down Expand Up @@ -360,7 +363,7 @@ struct par_disp_inner_impl<Pattern, Function, TypeList<Bounds...>, TypeList<Args
Function function, Args &&...args) {
auto bound_arr = bound_translator().GetIndexRanges(std::forward<Bounds>(bounds)...);
constexpr bool isSimdFor = std::is_same_v<InnerLoopPatternSimdFor, Pattern>;
constexpr std::size_t Nvector = dispatch_type::HierarchialPar::Nvector;
constexpr std::size_t Nvector = dispatch_type::HierarchicalPar::Nvector;
constexpr std::size_t Nthread = Rank - Nvector;
constexpr auto pattern_tag = LoopPatternTag<dispatch_type::GetPatternTag()>();

Expand Down Expand Up @@ -409,7 +412,7 @@ struct par_dispatch_impl<Tag, Pattern, Function, TypeList<Bounds...>, TypeList<A
Function function, Args &&...args, const int scratch_level = 0,
const std::size_t scratch_size_in_bytes = 0) {
constexpr std::size_t Ninner =
dispatch_type::HierarchialPar::Nvector + dispatch_type::HierarchialPar::Nthread;
dispatch_type::HierarchicalPar::Nvector + dispatch_type::HierarchicalPar::Nthread;

constexpr auto pattern_tag = LoopPatternTag<dispatch_type::GetPatternTag()>();
static_assert(
Expand Down Expand Up @@ -486,11 +489,16 @@ struct par_dispatch_impl<Tag, Pattern, Function, TypeList<Bounds...>, TypeList<A
Args &&...args, const int scratch_level,
const std::size_t scratch_size_in_bytes) {
static_assert(sizeof...(InnerIs) == 0);
kokkos_dispatch(
Tag(), name,
Kokkos::MDRangePolicy<Kokkos::Rank<Rank>>(exec_space, {bound_arr[OuterIs].s...},
{(1 + bound_arr[OuterIs].e)...}),
function, std::forward<Args>(args)...);
constexpr std::size_t Nouter = sizeof...(OuterIs);
Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working as expected?
If I infer the intent correctly, this should create an array initialized to 1 everywhere.
My compiler complains with a warning

/p/project/coldcluster/pgrete/athenapk/external/parthenon/src/kokkos_abstraction.hpp(493): warning #174-D: expression has no effect
      Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};

AFAIK default init doesn't work for arrays, so we might need sth like

    std::array<int, Nouter> tiling;
    tiling.fill(1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was working for me, but better to avoid the warning.

The warning makes sense since (OuterIs, 1) will always just evaluate to 1.

tiling[Nouter - 1] = bound_arr[Nouter - 1].e + 1 - bound_arr[Nouter - 1].s;
kokkos_dispatch(Tag(), name,
Kokkos::Experimental::require(
Kokkos::MDRangePolicy<Kokkos::Rank<Rank>>(
exec_space, {bound_arr[OuterIs].s...},
{(1 + bound_arr[OuterIs].e)...}, tiling),
Kokkos::Experimental::WorkItemProperty::HintLightWeight),
function, std::forward<Args>(args)...);
}

// Flatten loop bounds into a single outer team_policy
Expand All @@ -500,13 +508,14 @@ struct par_dispatch_impl<Tag, Pattern, Function, TypeList<Bounds...>, TypeList<A
Kokkos::Array<IndexRange, Rank> bound_arr, Function function,
Args &&...args, const int scratch_level,
const std::size_t scratch_size_in_bytes) {
const auto idxer =
MakeIndexer(Kokkos::Array<IndexRange, sizeof...(OuterIs)>{bound_arr[OuterIs]...});
const std::size_t size = ((bound_arr[OuterIs].e - bound_arr[OuterIs].s + 1) * ...);
kokkos_dispatch(
Tag(), name,
team_policy(exec_space, idxer.size(), Kokkos::AUTO)
team_policy(exec_space, size, Kokkos::AUTO)
.set_scratch_size(scratch_level, Kokkos::PerTeam(scratch_size_in_bytes)),
KOKKOS_LAMBDA(team_mbr_t team_member, ExtraFuncArgs... fargs) {
const auto idxer = MakeIndexer(
Kokkos::Array<IndexRange, sizeof...(OuterIs)>{bound_arr[OuterIs]...});
const auto idx_arr = idxer.GetIdxArray(team_member.league_rank());
function(team_member, idx_arr[OuterIs]...,
std::forward<ExtraFuncArgs>(fargs)...);
Expand All @@ -524,9 +533,9 @@ struct par_dispatch_impl<Tag, Pattern, Function, TypeList<Bounds...>, TypeList<A
const std::size_t scratch_size_in_bytes) {
const auto idxer =
MakeIndexer(Kokkos::Array<IndexRange, sizeof...(OuterIs)>{bound_arr[OuterIs]...});
using HierarchialPar = typename dispatch_type::HierarchialPar;
constexpr std::size_t Nvector = HierarchialPar::Nvector;
constexpr std::size_t Nthread = HierarchialPar::Nthread;
using HierarchicalPar = typename dispatch_type::HierarchicalPar;
constexpr std::size_t Nvector = HierarchicalPar::Nvector;
constexpr std::size_t Nthread = HierarchicalPar::Nthread;
constexpr std::size_t Nouter = Rank - Nvector - Nthread;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is Nouter here (given the Rank - Nvector - Nthread formula)?
I also tried to follow the MakeCollapse<Rank, Nouter trail below but I Nouter seems to become Nteam which is then not used anymore.
I'm probably missing sth here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one covers all the various TPT[RTV]R patterns. These are always 1 or 0 loops for either the vector or thread range loops in the inner pattern. Nouter is all the remaining loops that become flattened into an outer team policy loop.

kokkos_dispatch(
Tag(), name,
Expand Down
Loading