Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking Change] Tasking rewrite #987

Merged
merged 34 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
bee5684
trying to use new tasks
jdolence Dec 11, 2023
e881ad9
Merge branch 'lroberts36/bugfix-sparse-cache' into jdolence/new_tasking
jdolence Dec 14, 2023
90f3e59
remove debugging
jdolence Dec 14, 2023
92564e1
formatting
jdolence Dec 14, 2023
6fde57d
remove raw mpi.hpp include
jdolence Dec 14, 2023
2320c0e
style
jdolence Dec 14, 2023
95818ba
more style
jdolence Dec 14, 2023
d602a35
and more style
jdolence Dec 14, 2023
10a67f1
ok thats enough
jdolence Dec 14, 2023
23803d0
actually remove the old task stuff
jdolence Dec 14, 2023
a4db040
formatting
jdolence Dec 14, 2023
8b7d42a
maybe last style commit...
jdolence Dec 14, 2023
52f0d5a
oops, includes inside parthenon namespace
jdolence Dec 14, 2023
e6eb2e3
update TaskID unit test
jdolence Dec 14, 2023
ce7a6bb
missing header
jdolence Dec 14, 2023
1ddc2e0
port the poisson examples
jdolence Dec 15, 2023
0bd54cf
try to fix serial builds
jdolence Dec 15, 2023
6082812
clean up branching in `|` operator of TaskID
jdolence Dec 15, 2023
07ae71a
rename Queue ThreadQueue
jdolence Dec 15, 2023
c1dbcb3
formatting
jdolence Dec 15, 2023
fbbe02a
try to fix builds with threads
jdolence Dec 15, 2023
d39a31a
update tasking docs
jdolence Dec 18, 2023
b074ee6
formatting and update changelog
jdolence Dec 18, 2023
829e047
address review comments
jdolence Jan 9, 2024
fc16f0f
merge develop
jdolence Jan 9, 2024
b400c11
style
jdolence Jan 9, 2024
9957538
add a comment about the dependent variable in Task
jdolence Jan 9, 2024
6a33dd6
address review comments
jdolence Jan 19, 2024
bf290fc
Merge branch 'develop' into jdolence/new_tasking
jdolence Jan 19, 2024
6029f7d
add TaskQualifier to driver prelude
jdolence Jan 19, 2024
ae047de
move using statement
jdolence Jan 19, 2024
cf59020
fix bug in ThreadQueue
jdolence Jan 23, 2024
dc16a32
set final_residual in gmg and bicgstab even if they exit by reaching …
jdolence Jan 23, 2024
18628be
fix serial case for tasks marked completion and global_sync
jdolence Jan 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## Current develop

### Added (new features/APIs/variables/...)
- [[PR 987]](https://github.com/parthenon-hpc-lab/parthenon/pull/987) New tasking infrastructure and capabilities
- [[PR 969]](https://github.com/parthenon-hpc-lab/parthenon/pull/969) New macro-based auto-naming of profiling regions and kernels
- [[PR 981]](https://github.com/parthenon-hpc-lab/parthenon/pull/981) Add IndexSplit
- [[PR 983]](https://github.com/parthenon-hpc-lab/parthenon/pull/983) Add Contains to SparsePack
Expand All @@ -23,6 +24,7 @@
### Removed (removing behavior/API/varaibles/...)

### Incompatibilities (i.e. breaking changes)
- [[PR 987]](https://github.com/parthenon-hpc-lab/parthenon/pull/987) Change the API for what was IterativeTasks
- [[PR 974]](https://github.com/parthenon-hpc-lab/parthenon/pull/974) Change GetParentPointer to always return T*


Expand Down
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ endif()
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
find_package(Filesystem REQUIRED COMPONENTS Experimental Final)

# Require threading for tasks
find_package(Threads)

pgrete marked this conversation as resolved.
Show resolved Hide resolved
set(ENABLE_MPI OFF)
set(NUM_MPI_PROC_TESTING "4" CACHE STRING "Number of mpi processors to use when running tests with MPI")
if (NOT PARTHENON_DISABLE_MPI)
Expand Down
210 changes: 120 additions & 90 deletions doc/sphinx/src/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,85 +3,84 @@
Tasks
=====

Parthenon's tasking infrastructure is how downstream applications describe
and execute their work. Tasks are organized into a hierarchy of objects.
``TaskCollection``s have one or more ``TaskRegion``s, ``TaskRegion``s have
one or more ``TaskList``s, and ``TaskList``s can have one or more sublists
(that are themselves ``TaskList``s).

Task
----

Though downstream codes never have to interact with the ``Task`` object directly,
it's useful to describe nonetheless. A ``Task`` object is essentially a functor
that stores the necessary data to invoke a downstream code's functions with
the desired arguments. Importantly, however, it also stores information that
relates itself to other tasks, namely the tasks that must be complete before
it should execute and the tasks that may be available to run after it completes.
In other words, ``Task``s are nodes in a directed (possibly cyclic) graph, and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a TL be cyclic and ever run to completion? I thought being cyclic implied that it would never complete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm not using quite the right words. The graph can have cycles, but at least one node in the cycle acts as a conditional, either continuing the cycle or exiting it depending on some condition. Is there a better way to describe that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be that we just have different mental models for the graph. In my mind, the graph is just defined by task dependencies and its structure is static. Each node in the graph is then associated with a status and I look for tasks that are available to do and try to switch their status. In this model, iterative task lists are just groupings of tasks in the graph where a task can return iterate and that sets the status of all tasks in the iterative list to incomplete.

include the edges that connect to it and emerge from it.

TaskList
--------

The ``TaskList`` class implements methods to build and execute a set of
tasks with associated dependencies. The class implements a few public
facing member functions that provide useful functionality for downstream
apps:

AddTask
~~~~~~~

``AddTask`` is a templated variadic function that takes the task
function to be executed, the task dependencies (see ``TaskID`` below),
and the arguments to the task function as it’s arguments. All arguments
are captured by value in a lambda for later execution.

When adding functions that are non-static class member functions, a
slightly different interface is required. The first argument should be
the class-name-scoped name of the function. For example, for a function
named ``DoSomething`` in class ``SomeClass``, the first argument would
be ``&SomeClass::DoSomething``. The second argument should be a pointer
to the object that should invoke this member function. Finally, the
dependencies and function arguments should be provided as described
above.

Examples of both ``AddTask`` calls can be found in the advection example
`here <https://github.com/parthenon-hpc-lab/parthenon/blob/develop/example/advection/advection_driver.cpp>`__.

AddIteration
~~~~~~~~~~~~

``AddIteration`` provides a means of grouping a set of tasks together
that will be executed repeatedly until stopping criteria are satisfied.
``AddIteration`` returns an ``IterativeTasks`` object which provides
overloaded ``AddTask`` functions as described above, but internally
handles the bookkeeping necessary to maintain the association of all the
tasks associated with the iterative process. A special function
``SetCompletionTask``, which behaves identically to ``AddTask``, allows
a task to be defined that evaluates the stopping criteria. The maximum
number of iterations can be controlled through the ``SetMaxIterations``
member function and the number of iterations between evaluating the
stopping criteria can be set with the ``SetCheckInterval`` function.

DoAvailable
~~~~~~~~~~~

``DoAvailable`` loops over the task list once, executing all tasks whose
dependencies are satisfied. Completed tasks are removed from the task
list.

TaskID
------

The ``TaskID`` class implements methods that allow Parthenon to keep
track of tasks, their dependencies, and what remains to be completed.
The main way application code will interact with this object is as a
returned object from ``TaskList::AddTask`` and as an argument to
subsequent calls to ``TaskList::AddTask`` as a dependency for other
tasks. When used as a dependency, ``TaskID`` objects can be combined
with the bitwise or operator (``|``) to specify multiple dependencies.
The ``TaskList`` class stores a vector of all the tasks and sublists (a nested
``TaskList``) added to it. Additionally, it stores various bookkeeping
information that facilitate more advanced features described below. Adding
tasks and sublists are the only way to interact with ``TaskList`` objects.

The basic call to ``AddTask`` takes the task's dependencies, the function to be
executed, and the arguments to the function as its arguments. ``AddTask`` returns
a ``TaskID`` object that can be used in subsequent calls to ``AddTask`` as a
dependency either on its own or combined with other ``TaskID``s via the ``|``
jdolence marked this conversation as resolved.
Show resolved Hide resolved
operator. Use of the ``|`` operator is historical and perhaps a bit misleading as
it really acts as a logical and -- that is, all tasks combined with ``|`` must be
complete before the dependencies are satisfied. An overload of ``AddTask`` takes
a ``TaskQualifier`` object as the first argument which specifies certain special,
non-default behaviors. These will be described below. Note that the default
constructor of ``TaskID`` produces a special object that when passed into
``AddTask`` signifies that the task has no dependencies.

The ``AddSublist`` function adds a nested ``TaskList`` to the ``TaskList`` on
which its called. The principle use case for this is to add iterative cycles
to the graph, allowing one to execute a series of tasks repeatedly until some
criteria are satisfied. The call takes as arguments the dependencies (via
``TaskID``s combined with ``|``) that must be complete before the sublist
exectues and a ``std::pair<int, int>`` specifying the minimum
and maximum number of times the sublist should execute. Passing something like
``{min_iters, max_iters}`` as the second argument should suffice, with `{1, 1}`
leading to a sublist that never cycles. ``AddSublist``
returns a ``std::pair<TaskList&, TaskID>`` which is conveniently accessed via
a structured binding, e.g.
.. code:: cpp
TaskID none;
auto [child_list, child_list_id] = parent_list.AddSublist(dependencies, {1,3});
auto task_id = child_list.AddTask(none, SomeFunction, arg1, arg2);
In the above example, passing ``none`` as the dependency for the task added to
``child_list`` does not imply that this task can execute at any time since
``child_list`` itself has dependencies that must be satisfied before any of its
tasks can be invoked.

TaskRegion
----------

``TaskRegion`` is a lightweight class that wraps
``std::vector<TaskList>``, providing a little extra functionality.
During task execution (described below), all task lists in a
``TaskRegion`` can be operated on concurrently. For example, a
``TaskRegion`` can be used to construct independent task lists for each
``MeshBlock``. Occasionally, it is useful to have a task not be
considered complete until that task completes in all lists of a region.
For example, a global iterative solver cannot be considered complete
until the stopping criteria are satisfied everywhere, which may require
evaluating those criteria in tasks that live in different lists within a
region. An example of this use case is
shown `here <https://github.com/parthenon-hpc-lab/parthenon/blob/develop/example/poisson/poisson_driver.cpp>`__. The mechanism
to mark a task so that dependent tasks will wait until all lists have
completed it is to call ``AddRegionalDependencies``, as shown in the
Poisson example.
Under the hood, a ``TaskRegion`` is a directed, possibly cyclic graph. The graph
is built up incrementally as tasks are added to the ``TaskList``s within the
``TaskRegion``, and it's construction is completed upon the first time it's
executed. ``TaskRegion``s can have one or more ``TaskList``s. The primary reason
for this is to allow flexibility in how work is broken up into tasks (and
eventually kernels). A region with many lists will produce many small
tasks/kernels, but may expose more asynchrony (e.g. MPI communication). A region
with fewer lists will produce more work per kernel (which may be good for GPUs,
for example), but may limit asynchrony. Typically, each list is tied to a unique
partition of the mesh blocks owned by a rank. ``TaskRegion`` only provides a few
public facing functions:
- ``TaskListStatus Execute(ThreadPool &pool)``: ``TaskRegion``s can be executed, requiring a
``ThreadPool`` be provided by the caller. In practice, ``Execute`` is usually
called from the ``Execute`` member function of ``TaskCollection``.
- ``TaskList& operator[](const int i)``: return a reference to the ``i``th
``TaskList`` in the region.
- ``size_t size()``: return the number of ``TaskList``s in the region.

TaskCollection
--------------
Expand Down Expand Up @@ -120,21 +119,52 @@ is shown below.
.. figure:: figs/TaskDiagram.png
:alt: Task Diagram

``TaskCollection`` provides two member functions, ``AddRegion`` and
``Execute``.

AddRegion
~~~~~~~~~

``AddRegion`` simply adds a new ``TaskRegion`` to the back of the
collection and returns it as a reference. The integer argument
determines how many task lists make up the region.

Execute
~~~~~~~

Calling the ``Execute`` method on the ``TaskCollection`` executes all
the tasks that have been added to the collection, processing each
``TaskRegion`` in the order they were added, and allowing tasks in
different ``TaskList``\ s but the same ``TaskRegion`` to be executed
concurrently.
``TaskCollection`` provides a few
public-facing functions:
- ``TaskRegion& AddRegion(const int num_lists)``: Add and return a reference to
a new ``TaskRegion`` with the specified number of ``TaskList``s.
- ``TaskListStatus Execute(ThreadPool &pool)``: Execute all regions in the
collection. Regions are executed completely, in the order they were added,
before moving on to the next region. Task execution will take advantage of
the provided ``ThreadPool`` to (possibly) execute tasks across ``TaskList``s
in each region concurrently.
- ``TaskListStatus Execute()``: Same as above, but execution will use an
internally generated ``ThreadPool`` with a single thread.

NOTE: Work remains to make the rest of
Parthenon thread-safe, so it is currently required to use a ``ThreadPool``
with one thread.

TaskQualifier
-------------

``TaskQualifier``s provide a mechanism for downstream codes to alter the default
behavior of specific tasks in certain ways. The qualifiers are described below:
- ``TaskQualifier::local_sync``: Tasks marked with ``local_sync`` synchronize across
lists in a region on a given MPI rank. Tasks that depend on a ``local_sync``
marked task gain dependencies from the corresponding task on all lists within
a region. A typical use for this qualifier is to do a rank-local reduction, for
example before initiating a global MPI reduction (which should be done only once
per rank, not once per ``TaskList``). Note that Parthenon links tasks across
lists in the order they are added to each list, i.e. the ``n``th ``local_sync`` task
in a list is assumed to be associated with the ``n``th ``local_sync`` task in all
lists in the region.
- ``TaskQualifier::global_sync``: Tasks marked with ``global_sync`` implicitly have
the same semantics as ``local_sync``, but additionally do a global reduction on the
Comment on lines +152 to +153
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a use case for global_sync? It seems like there will always be some other MPI communication (like a reduction) associated with any point in the task list that requires all ranks to be at the same point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The important distinction of global_sync is that it does a reduction on the statuses across all MPI ranks to determine if a task is complete or not. Technically this may not be required if all MPI ranks are guaranteed to return the same status, as would be the case if they all checked some condition on some globally reduced quantity. Given that, we can probably avoid using global_sync in some places where one might naively think that a reduction on statuses is required. But I can imagine use cases where different ranks evaluate different things and the global reduction of statuses would be required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that if we want to sync all ranks, there is always going to be some information that gets communicated between them (more than just that they are done). Maybe I am not being imaginative enough though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me try to make something up quickly. Say we're pushing particles around that can move across many cells in a given time step. One algorithm you could code up would have every rank push all the particles that live on it's blocks until the end of the time step or until it leaves the rank's domain (at which point it sends them to the right neighbor). You could write this as an iterative thing with the completion check being a check of whether all particles on the rank have reached the end of the time step. Then the completion check is totally local to the rank, but you need to continue iterating globally until everybody agrees things are done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that could also just be written as a reduction to get the total number of active particles in the simulation.

``TaskStatus`` to determine if/when execution can proceed on to dependent tasks.
- ``TaskQualifier::completion``: Tasks marked with ``completion`` can lead to exiting
execution of the owning ``TaskList``. If these tasks return ``TaskStatus::complete``
and the minimum number of iterations of the list have been completed, the remainder
of the task list will be skipped (or the iteration stopped). Returning
``TaskList::iterate`` leads to continued execution/iteration, unless the maximum
number of iterations has been reached.
- ``TaskQualifier::once_per_region``: Tasks with the ``once_per_region`` qualifier
will only execute once (per iteration, if relevant) regardless of the number of
``TaskList``s in the region. This can be useful when, for example, doing MPI
reductions, printing out some rank-wide state, or calling a ``completion`` task
that depends on some global condition where all lists would evaluate identical code.

``TaskQualifier``s can be combined via the ``|`` operator and all combinations are
supported. For example, you might mark a task ``global_sync | completion | once_per_region``
if it were a task to determine whether an iteration should continue that depended
on some previously reduced quantity.
Loading
Loading