-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Breaking Change] Tasking rewrite #987
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner than the old tasking machinery. Some things I like:
- The graph walking is clear and concise, and dependencies are much easier to reasona bout.
- The fact that sub lists are recursive means iterations and sublists and regional dependencies are all easier to reason about.
- Threads!!!!
My one concern would be that walking the graph might be expensive in practice, but I suspect that it's not, compared to everything else we're doing. Especially since the task list can now be built once and not rebuilt (which is something that should probably be default behavior.
enum class TaskStatus { fail, complete, incomplete, iterate, skip, waiting }; | ||
enum class TaskStatus { complete, incomplete, iterate }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
std::lock_guard<std::mutex> lock(mutex); | ||
std::queue<T>().swap(queue); | ||
complete = true; | ||
exit = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of shocked exit
is not a protected name.
TaskID GetID() { return this; } | ||
bool ready() { | ||
// check that no dependency is incomplete | ||
bool go = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool go = true; | |
// set... | |
bool go = true; |
ready... set... go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
Co-authored-by: Jonah Miller <[email protected]>
@pgrete, I'd like to get this merged ASAP. A couple different codes want to use the new mechanism to express iterations. Think you can review soon? Should we just merge? How should we proceed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the interface cleanup (and infrastructure below) and sorry for the delay (I lost track of this PR after I reviewed the other after the sync).
I only left some clarifying questions.
Otherwise, a more general question would whether the new execute machinery benefits from the threaded implementation as is, or, in other words if we don't support that feature anyway, could this functionality be stashed until it'll actually be used so that there's less "to-be-used-in-the-future(-but-maintained-already-now)" code?
using namespace utils; | ||
auto &md = pmesh->mesh_data.GetOrAdd("base", i); | ||
TaskID none; | ||
auto &md = pmesh->mesh_data.GetOrAdd("base", partition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging #959 again.
I'm actually starting to wonder if we should remove support (i.e., make it technically impossible) to vary the partition size as quite a few place are now making some implicit assumption through using GetOrAdd
.
PS: This is independent of this PR.
if (num_calls < exec_limits.first && status == TaskStatus::complete) | ||
status = TaskStatus::iterate; | ||
// enforce maximum number of iterations | ||
if (num_calls == exec_limits.second) status = TaskStatus::complete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have another status like ::ran_out_of_iterations_before_X
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably. I suppose that would allow adding some kind of option that would warn or abort if an iteration failed to properly complete. That seems like a good idea. I'll either add an issue so I can deal with this later or add the feature if it looks simple enough.
PARTHENON_REQUIRE_THROWS(pool.size() == 1, | ||
"ThreadPool size != 1 is not currently supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for being explicit
@pgrete I think there's a chance the current threading may already benefit my use case, because I spend a lot of CPU time building variable lists & packs, so there's the potential that any reduced synchronicity speeds things up for me in at least some cases. |
The threads currently provide no value, but I've already started working on making other parts of parthenon thread-safe to enable nthreads != 1, so I'm hoping this code won't sit around for long in it's current state. Given that, I'd rather not do the extra work of removing here just to add back very soon. It's a fair point/question though -- I'll try to be quick about getting threads working more fully. |
That's actually an interesting/important use-case/questions: Is our packing machinery thread safe?
No worries. I think as long we have the handbrakes on (enforcing |
Not sure why the status for CI are not reported. I re-triggering. |
CI failed and looks like a legit fail. |
Looks that way, I think. Haven't been able to reproduce though... |
PR Summary
[Breaking Change]
The API for representing iterations through our tasks has changed.
AddIteration
is replaced withAddSublist
and the arguments and return values are a little different. Additionally,AddCompletionTask
is replaced withAddTask
withTaskQualifier
s to specify that it is a completion task (and whatever other qualifiers are appropriate).AddRegionalDependencies
is similarly replaced by providing the appropriateTaskQualifier
toAddTask
.It's high time we cleaned up the tasking code. I take a crack at that in this PR. For the most part, this should be compatible with downstream codes, but it will be breaking for anyone using
IterativeTasks
. Some basics:TaskID
isIterativeTasks
in favor ofTaskList
s owningTaskList
sFun new features:
TaskCollection
Execute
function and tasks are launched in parallel across the threads in the pool. If you pass no pool toExecute
, it just constructs a pool with one thread, so should execute more or less how it did before.TaskCollection/TaskRegion/TaskList
is no longer self-destructive. This allows for a model where a task collection is built once and executed many times.IterativeTasks
is now more flexibile and the interface is cleaner.TaskList::AddSublist
function.{1,1}
means it's just a sublist that won't iterate.TaskQualifier
specifier. You can combine the qualifiers below with the|
operator, just like for dependencies. The interesting ones arelocal_sync
: this replaces the old regional dependencies stuff. It says that dependent tasks can't execute until all the lists in your rank's region have executed this taskglobal_sync
: this says that all tasks across all ranks and lists must have executed before dependent tasks can executecompletion
: allows for tasks that conditionally continue execution of the list/cycle back to the beginning (TaskStatus::iterate
) or abort execution of the rest of the list/stop iterating (TaskStatus::complete
). The min/max number of iterations for a list are always honored regardless of the status returned by completion tasks.once_per_region
: for a region withN
partitions, a tasked marked with this qualifier will only execute a single time instead ofN
times as usual.I don't add any specific tests because all of the functionality is well tested in our current regression tests. The only exception is multithreaded task execution. For now, this feature shouldn't be used anyway because Parthenon has not been made thread-safe everywhere it needs to be. There is a note to this effect in the docs. Also, just calling
Execute()
on aTaskCollection
(as everyone is presumably doing downstream) leads to single-threaded execution.PR Checklist