Skip to content

Commit

Permalink
Fix remaining swarm D->H->D copies (#1145)
Browse files Browse the repository at this point in the history
* working

* Clean up implicit captures

* OK something isnt working with updating empty_indices

* Creating indices works but defragmentation is broken

* Defrag works

* Switch to persistent scratch memory

* Clean up

* Fix GPU issues

* Fix compile error by cleaning up code

* Remove unnecessary check against non-null user swarm BCs

* Remove unused function

* Formatting

* Fiddle with send logic

* Perform swarm boundary logic on device (#1154)

* First kernel down

* Further cleanup

* Notes

* Send seems to work

* Need to promote particles example to latest pattern

* May have fixed particles example

* cycles...

* iterative tasking is only iterating on one meshblock in particles example...

* Still not working...

* New loop seems to work

* Cleaned up some printfs/marked unused code for deletion

* New algorithm is cycling, time to clean up

* Fixed indexing bug...

* Clean up

* finished_transport shouldnt be provided by swarm

* Reverting to old manual iterative tasking

* Still working...

* Starting to make progress...

* Cleaned up

* format

* A few leftover print statements

* Oops ParArray1D isn't a host array when compiled for device

* implicit this->

* bug in nrecvd particles with 1 particle received...

* Found the bug

* Fixed bug, cleaned up

* kokkos parallel_scan -> par_scan

* Fix types, clean up code

* Add warning if using swarms but Real != double

* Cleanup from debugging

---------

Co-authored-by: Luke Roberts <[email protected]>
  • Loading branch information
brryan and lroberts36 authored Aug 28, 2024
1 parent e0bd43b commit 517bf92
Show file tree
Hide file tree
Showing 14 changed files with 338 additions and 468 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
- [[PR 1004]](https://github.com/parthenon-hpc-lab/parthenon/pull/1004) Allow parameter modification from an input file for restarts

### Fixed (not changing behavior/API/variables/...)
- [[PR 1145]](https://github.com/parthenon-hpc-lab/parthenon/pull/1145) Fix remaining swarm D->H->D copies
- [[PR 1150]](https://github.com/parthenon-hpc-lab/parthenon/pull/1150) Reduce memory consumption for buffer pool
- [[PR 1146]](https://github.com/parthenon-hpc-lab/parthenon/pull/1146) Fix an issue outputting >4GB single variables per rank
- [[PR 1152]](https://github.com/parthenon-hpc-lab/parthenon/pull/1152) Fix memory leak in task graph outputs related to `abi::__cxa_demangle`
Expand Down
6 changes: 2 additions & 4 deletions example/particles/parthinput.particles
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,8 @@ refinement = none
nx1 = 16
x1min = -0.5
x1max = 0.5
ix1_bc = user
ox1_bc = user
# ix1_bc = periodic # Optionally use periodic boundary conditions everywhere
# ox1_bc = periodic
ix1_bc = periodic
ox1_bc = periodic

nx2 = 16
x2min = -0.5
Expand Down
180 changes: 82 additions & 98 deletions example/particles/particles.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -340,8 +340,7 @@ TaskStatus CreateSomeParticles(MeshBlock *pmb, const double t0) {
return TaskStatus::complete;
}

TaskStatus TransportParticles(MeshBlock *pmb, const StagedIntegrator *integrator,
const double t0) {
TaskStatus TransportParticles(MeshBlock *pmb, const double t0, const double dt) {
PARTHENON_INSTRUMENT

auto swarm = pmb->meshblock_data.Get()->GetSwarmData()->Get("my_particles");
Expand All @@ -350,8 +349,6 @@ TaskStatus TransportParticles(MeshBlock *pmb, const StagedIntegrator *integrator

int max_active_index = swarm->GetMaxActiveIndex();

Real dt = integrator->dt;

auto &t = swarm->Get<Real>("t").Get();
auto &x = swarm->Get<Real>(swarm_position::x::name()).Get();
auto &y = swarm->Get<Real>(swarm_position::y::name()).Get();
Expand Down Expand Up @@ -469,97 +466,31 @@ TaskStatus TransportParticles(MeshBlock *pmb, const StagedIntegrator *integrator
// Custom step function to allow for looping over MPI-related tasks until complete
TaskListStatus ParticleDriver::Step() {
TaskListStatus status;
integrator.dt = tm.dt;

PARTHENON_REQUIRE(integrator.nstages == 1,
"Only first order time integration supported!");

BlockList_t &blocks = pmesh->block_list;
auto num_task_lists_executed_independently = blocks.size();

// Create all the particles that will be created during the step
status = MakeParticlesCreationTaskCollection().Execute();
PARTHENON_REQUIRE(status == TaskListStatus::complete,
"ParticlesCreation task list failed!");

// Loop over repeated MPI calls until every particle is finished. This logic is
// required because long-distance particle pushes can lead to a large, unpredictable
// number of MPI sends and receives.
bool particles_update_done = false;
while (!particles_update_done) {
status = MakeParticlesUpdateTaskCollection().Execute();

particles_update_done = true;
for (auto &block : blocks) {
// TODO(BRR) Despite this "my_particles"-specific call, this function feels like it
// should be generalized
auto swarm = block->meshblock_data.Get()->GetSwarmData()->Get("my_particles");
if (!swarm->finished_transport) {
particles_update_done = false;
}
}
}
// Transport particles iteratively until all particles reach final time
status = IterativeTransport();
// status = MakeParticlesTransportTaskCollection().Execute();
PARTHENON_REQUIRE(status == TaskListStatus::complete,
"IterativeTransport task list failed!");

// Use a more traditional task list for predictable post-MPI evaluations.
status = MakeFinalizationTaskCollection().Execute();
PARTHENON_REQUIRE(status == TaskListStatus::complete, "Finalization task list failed!");

return status;
}

// TODO(BRR) This should really be in parthenon/src... but it can't just live in Swarm
// because of the loop over blocks
TaskStatus StopCommunicationMesh(const BlockList_t &blocks) {
PARTHENON_INSTRUMENT

int num_sent_local = 0;
for (auto &block : blocks) {
auto sc = block->meshblock_data.Get()->GetSwarmData();
auto swarm = sc->Get("my_particles");
swarm->finished_transport = false;
num_sent_local += swarm->num_particles_sent_;
}

int num_sent_global = num_sent_local; // potentially overwritten by following Allreduce
#ifdef MPI_PARALLEL
for (auto &block : blocks) {
auto swarm = block->meshblock_data.Get()->GetSwarmData()->Get("my_particles");
for (int n = 0; n < block->neighbors.size(); n++) {
NeighborBlock &nb = block->neighbors[n];
// TODO(BRR) May want logic like this if we have non-blocking TaskRegions
// if (nb.snb.rank != Globals::my_rank) {
// if (swarm->vbswarm->bd_var_.flag[nb.bufid] != BoundaryStatus::completed) {
// return TaskStatus::incomplete;
// }
//}

// TODO(BRR) May want to move this logic into a per-cycle initialization call
if (swarm->vbswarm->bd_var_.flag[nb.bufid] == BoundaryStatus::completed) {
swarm->vbswarm->bd_var_.req_send[nb.bufid] = MPI_REQUEST_NULL;
}
}
}

MPI_Allreduce(&num_sent_local, &num_sent_global, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
#endif // MPI_PARALLEL

if (num_sent_global == 0) {
for (auto &block : blocks) {
auto &pmb = block;
auto sc = pmb->meshblock_data.Get()->GetSwarmData();
auto swarm = sc->Get("my_particles");
swarm->finished_transport = true;
}
}

// Reset boundary statuses
for (auto &block : blocks) {
auto &pmb = block;
auto sc = pmb->meshblock_data.Get()->GetSwarmData();
auto swarm = sc->Get("my_particles");
for (int n = 0; n < pmb->neighbors.size(); n++) {
auto &nb = block->neighbors[n];
swarm->vbswarm->bd_var_.flag[nb.bufid] = BoundaryStatus::waiting;
}
}

return TaskStatus::complete;
}

TaskCollection ParticleDriver::MakeParticlesCreationTaskCollection() const {
TaskCollection tc;
TaskID none(0);
Expand All @@ -577,40 +508,93 @@ TaskCollection ParticleDriver::MakeParticlesCreationTaskCollection() const {
return tc;
}

TaskCollection ParticleDriver::MakeParticlesUpdateTaskCollection() const {
TaskStatus CountNumSent(const BlockList_t &blocks, const double tf_, bool *done) {
int num_unfinished = 0;
for (auto &block : blocks) {
auto sc = block->meshblock_data.Get()->GetSwarmData();
auto swarm = sc->Get("my_particles");
int max_active_index = swarm->GetMaxActiveIndex();

auto &t = swarm->Get<Real>("t").Get();

auto swarm_d = swarm->GetDeviceContext();

const auto &tf = tf_;

parthenon::par_reduce(
PARTHENON_AUTO_LABEL, 0, max_active_index,
KOKKOS_LAMBDA(const int n, int &num_unfinished) {
if (swarm_d.IsActive(n)) {
if (t(n) < tf) {
num_unfinished++;
}
}
},
Kokkos::Sum<int>(num_unfinished));
}

#ifdef MPI_PARALLEL
MPI_Allreduce(MPI_IN_PLACE, &num_unfinished, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
#endif // MPI_PARALLEL

if (num_unfinished > 0) {
*done = false;
} else {
*done = true;
}

return TaskStatus::complete;
}

TaskCollection ParticleDriver::IterativeTransportTaskCollection(bool *done) const {
TaskCollection tc;
TaskID none(0);
const double t0 = tm.time;
const BlockList_t &blocks = pmesh->block_list;
const int nblocks = blocks.size();
const double t0 = tm.time;
const double dt = tm.dt;

auto num_task_lists_executed_independently = blocks.size();

TaskRegion &async_region0 = tc.AddRegion(num_task_lists_executed_independently);
for (int i = 0; i < blocks.size(); i++) {
TaskRegion &async_region = tc.AddRegion(nblocks);
for (int i = 0; i < nblocks; i++) {
auto &pmb = blocks[i];

auto &sc = pmb->meshblock_data.Get()->GetSwarmData();
auto &tl = async_region[i];

auto &tl = async_region0[i];

auto transport_particles =
tl.AddTask(none, TransportParticles, pmb.get(), &integrator, t0);

auto send = tl.AddTask(transport_particles, &SwarmContainer::Send, sc.get(),
BoundaryCommSubset::all);
auto transport = tl.AddTask(none, TransportParticles, pmb.get(), t0, dt);
auto reset_comms =
tl.AddTask(transport, &SwarmContainer::ResetCommunication, sc.get());
auto send =
tl.AddTask(reset_comms, &SwarmContainer::Send, sc.get(), BoundaryCommSubset::all);
auto receive =
tl.AddTask(send, &SwarmContainer::Receive, sc.get(), BoundaryCommSubset::all);
}

TaskRegion &sync_region0 = tc.AddRegion(1);
TaskRegion &sync_region = tc.AddRegion(1);
{
auto &tl = sync_region0[0];
auto stop_comm = tl.AddTask(none, StopCommunicationMesh, blocks);
auto &tl = sync_region[0];
auto check_completion = tl.AddTask(none, CountNumSent, blocks, t0 + dt, done);
}

return tc;
}

// TODO(BRR) to be replaced by iterative tasklist machinery
TaskListStatus ParticleDriver::IterativeTransport() const {
TaskListStatus status;
bool transport_done = false;
int n_transport_iter = 0;
int n_transport_iter_max = 1000;
while (!transport_done) {
status = IterativeTransportTaskCollection(&transport_done).Execute();

n_transport_iter++;
PARTHENON_REQUIRE(n_transport_iter < n_transport_iter_max,
"Too many transport iterations!");
}

return status;
}

TaskCollection ParticleDriver::MakeFinalizationTaskCollection() const {
TaskCollection tc;
TaskID none(0);
Expand Down
4 changes: 3 additions & 1 deletion example/particles/particles.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@ class ParticleDriver : public EvolutionDriver {
ParticleDriver(ParameterInput *pin, ApplicationInput *app_in, Mesh *pm)
: EvolutionDriver(pin, app_in, pm), integrator(pin) {}
TaskCollection MakeParticlesCreationTaskCollection() const;
TaskCollection MakeParticlesUpdateTaskCollection() const;
TaskCollection MakeParticlesTransportTaskCollection() const;
TaskListStatus IterativeTransport() const;
TaskCollection IterativeTransportTaskCollection(bool *done) const;
TaskCollection MakeFinalizationTaskCollection() const;
TaskListStatus Step();

Expand Down
7 changes: 7 additions & 0 deletions src/interface/state_descriptor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,13 @@ StateDescriptor::CreateResolvedStateDescriptor(Packages_t &packages) {
field_tracker.CategorizeCollection(name, field_dict, &field_provider);
swarm_tracker.CategorizeCollection(name, package->AllSwarms(), &swarm_provider);

if (!package->AllSwarms().empty() && !std::is_same<Real, double>::value) {
PARTHENON_WARN(
"Swarms always use Real precision, even for ParticleVariables containing "
"time data, while Parthenon time variables are fixed to double precision. This "
"may cause inaccurate comparisons with cycle beginning and end times.")
}

// Add package registered boundary conditions
for (int i = 0; i < 6; ++i)
state->UserBoundaryFunctions[i].insert(state->UserBoundaryFunctions[i].end(),
Expand Down
Loading

0 comments on commit 517bf92

Please sign in to comment.