Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption for buffer pool #1150

Merged
merged 7 commits into from
Aug 22, 2024
Merged

Conversation

pgrete
Copy link
Collaborator

@pgrete pgrete commented Aug 8, 2024

PR Summary

Tries to reduce the memory footprint of the object pool by precalculating the number of buffers required for each type.
Also fixes an original overflow for very large buffer sizes (and the default 200 number of buffers).

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@pgrete
Copy link
Collaborator Author

pgrete commented Aug 8, 2024

@lroberts36 I came up with a different approach (not using a user input variable and keeping the allocations in large chunks).
Is there anything wrong with this approach (i.e., is sth going wrong in extending an existing pool this way)?
I haven't tested this yet and just want to make sure this is conceptually possible given the object pool approach.

@lroberts36
Copy link
Collaborator

@pgrete: I think this will work but it is maybe not ideal. First, I think this counts all buffers even for variables that are not allocated, so when running with pack size = -1 this will allocate buffers for every field wether or not it is allocated. Second, the number of buffers allocated will depend on the partition of the block list since this routine operates per MeshData.

One other thing I remembered is that the buffer sizes are not calculated as accurately as they could be. Rather than calculate the actual index range, it just uses a heuristic for the size that is guaranteed to be larger but is possibly too large by a factor of a couple.

@pgrete
Copy link
Collaborator Author

pgrete commented Aug 9, 2024

@pgrete: I think this will work but it is maybe not ideal. First, I think this counts all buffers even for variables that are not allocated, so when running with pack size = -1 this will allocate buffers for every field wether or not it is allocated. Second, the number of buffers allocated will depend on the partition of the block list since this routine operates per MeshData.

How about the updated version? I now only count buffers for allocated variables. I reverted again to count buffer for all vars as IIRC we always allocated the comms (as they're also used to communicate the allocations for sparse vars, or did this change?)

Regarding the second point, I don't think this is going to be a huge issue in practice because the extreme case with pack_size=few and many (small) blocks per rank is only occurring on host runs for which memory allocation are much less of an issue than on device runs.
Similarly, in cases with many buffers I can imagine that even the 200 could have easily been exhausted so that all subsequent buffer requests are serial (which now remain serial in the worst case -- with slight overhead of additional initial allocations but fewer than 200 given that I think it's safe to assume that a few buffers always have the same size in practice -- and in the best case even reduce the overhead as more buffers are created in one go).

Finally, I think that this is generally a tricky issue to optimize, which is why I'm trying to go for a default that covers most use cases and not introducing a parameter that the (inexperienced) end user might easily miss (or misconfigure).

What do you think?

One other thing I remembered is that the buffer sizes are not calculated as accurately as they could be. Rather than calculate the actual index range, it just uses a heuristic for the size that is guaranteed to be larger but is possibly too large by a factor of a couple.

I don't think it's an issue. The differences should be small-ish (especially for large block sizes and for very small block sizes the overhead in terms of active to passive cells is huge anyway).

@pgrete
Copy link
Collaborator Author

pgrete commented Aug 9, 2024

Okay, I need more eyes/input here as I don't understand what's going on.
The current code crashes.
The stack trace points back to line 75 (the lambda function that is in the pool_map, which eventually goes back through some object_pool Get() to the buf.Allocate() call in line 155 of bvals_utils.hpp.
How are those related? I'm starting to wonder if my picture of how the object pools work is completely off.
Also, if I count the number of buffers required twice (i.e. doing nbufs[buf_size] += 2) everything works as usual again.

@pgrete
Copy link
Collaborator Author

pgrete commented Aug 12, 2024

Alright, I think I figured out (and understand) more now.
The original BuildBoundaryBufferSubset never actually added objects to the pool, it only created a pool with a callback lambda that would eventually add objects to the pool dynamically, right?

So I updated the logic, that the pool creation remains (with a default number of buffer to be dynamically being added scaled by the number of packs per rank) and added routine to immediately add object to the pools (matching the required number of buffers for the given meshdata container).

For the ./example/advection/parthinput.advection this now means at

  • pack_size=-1:
Setting up a new pool for buffers of size 18
Reserving 64 new buffers of size 18 to pool with 0 buffers because 64 are required in total.
Setting up a new pool for buffers of size 126
Reserving 64 new buffers of size 126 to pool with 0 buffers because 64 are required in total.
Reserving 32 new buffers of size 18 to pool with 64 buffers because 96 are required in total.
Reserving 56 new buffers of size 126 to pool with 64 buffers because 120 are required in total.
Setting up a new pool for buffers of size 378
Reserving 16 new buffers of size 378 to pool with 0 buffers because 16 are required in total.
Reserving 208 new buffers of size 18 to pool with 96 buffers because 304 are required in total.
Reserving 256 new buffers of size 126 to pool with 120 buffers because 376 are required in total.
Reserving 32 new buffers of size 378 to pool with 16 buffers because 48 are required in total.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
...
This run is a restart: 0

cycle=0 time=0.0000000000000000e+00 dt=1.7578125000000000e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=1.17e-02 wsec_step=2.72e+00 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
---------------------- Current Mesh structure ----------------------
Number of Trees = 16
Total number of MeshBlocks = 88
Number of physical refinement levels = 2
Number of logical  refinement levels = 2
  Physical level = 0 (logical level = 0): 4 MeshBlocks, cost = 4
  Physical level = 1 (logical level = 1): 36 MeshBlocks, cost = 36
  Physical level = 2 (logical level = 2): 48 MeshBlocks, cost = 48
--------------------------------------------------------------------
cycle=1 time=1.7578125000000000e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.66e+06 wsec_total=2.53e-02 wsec_step=1.36e-02 zone-cycles/wsec=1.66e+06 wsec_AMR=3.88e-06
cycle=2 time=3.5156250000000001e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=3.72e+06 wsec_total=3.14e-02 wsec_step=6.06e-03 zone-cycles/wsec=3.72e+06 wsec_AMR=2.81e-06
Reserving 24 new buffers of size 18 to pool with 304 buffers because 328 are required in total.
Reserving 24 new buffers of size 126 to pool with 376 buffers because 400 are required in total.
cycle=3 time=5.2734375000000003e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=3.30e+06 wsec_total=4.42e-02 wsec_step=6.83e-03 zone-cycles/wsec=1.76e+06 wsec_AMR=5.99e-03
...
cycle=565 time=9.9316406249998979e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=2.51e+06 wsec_total=6.96e+00 wsec_step=1.27e-02 zone-cycles/wsec=2.48e+06 wsec_AMR=1.38e-04
Reserving 20 new buffers of size 18 to pool with 460 buffers because 480 are required in total.
Reserving 8 new buffers of size 126 to pool with 532 buffers because 540 are required in total.
cycle=566 time=9.9492187499998974e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=2.55e+06 wsec_total=6.98e+00 wsec_step=1.25e-02 zone-cycles/wsec=1.37e+06 wsec_AMR=1.07e-02
  • pack_size=1
Setting up a new pool for buffers of size 18
Reserving 4 new buffers of size 18 to pool with 0 buffers because 4 are required in total.
Setting up a new pool for buffers of size 126
Reserving 4 new buffers of size 126 to pool with 0 buffers because 4 are required in total.
Dynamically adding 16 buffers of size 18 to a pool with current size 4 and future size 20
Dynamically adding 16 buffers of size 126 to a pool with current size 4 and future size 20
Dynamically adding 16 buffers of size 18 to a pool with current size 20 and future size 36
Dynamically adding 16 buffers of size 126 to a pool with current size 20 and future size 36
Dynamically adding 16 buffers of size 18 to a pool with current size 36 and future size 52
Dynamically adding 16 buffers of size 126 to a pool with current size 36 and future size 52
Dynamically adding 16 buffers of size 18 to a pool with current size 52 and future size 68
Dynamically adding 16 buffers of size 126 to a pool with current size 52 and future size 68
Setting up a new pool for buffers of size 378
Reserving 2 new buffers of size 378 to pool with 0 buffers because 2 are required in total.
Dynamically adding 16 buffers of size 126 to a pool with current size 68 and future size 84
Dynamically adding 16 buffers of size 126 to a pool with current size 84 and future size 100
Dynamically adding 16 buffers of size 18 to a pool with current size 68 and future size 84
...
Dynamically adding 16 buffers of size 126 to a pool with current size 372 and future size 388
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
...
This run is a restart: 0

cycle=0 time=0.0000000000000000e+00 dt=1.7578125000000000e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=1.47e-02 wsec_step=2.73e+00 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
---------------------- Current Mesh structure ----------------------
Number of Trees = 16
Total number of MeshBlocks = 88
Number of physical refinement levels = 2
Number of logical  refinement levels = 2
  Physical level = 0 (logical level = 0): 4 MeshBlocks, cost = 4
  Physical level = 1 (logical level = 1): 36 MeshBlocks, cost = 36
  Physical level = 2 (logical level = 2): 48 MeshBlocks, cost = 48
--------------------------------------------------------------------
Dynamically adding 28 buffers of size 378 to a pool with current size 4 and future size 32
Dynamically adding 28 buffers of size 378 to a pool with current size 32 and future size 60
cycle=1 time=1.7578125000000000e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=9.20e+05 wsec_total=3.92e-02 wsec_step=2.45e-02 zone-cycles/wsec=9.20e+05 wsec_AMR=4.52e-06
cycle=2 time=3.5156250000000001e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.48e+06 wsec_total=5.44e-02 wsec_step=1.52e-02 zone-cycles/wsec=1.48e+06 wsec_AMR=5.91e-06
Dynamically adding 16 buffers of size 18 to a pool with current size 308 and future size 324
Dynamically adding 16 buffers of size 126 to a pool with current size 388 and future size 404
Dynamically adding 16 buffers of size 18 to a pool with current size 324 and future size 340
cycle=3 time=5.2734375000000003e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.47e+06 wsec_total=7.82e-02 wsec_step=1.53e-02 zone-cycles/wsec=9.48e+05 wsec_AMR=8.48e-03
...
cycle=565 time=9.9316406249998979e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.37e+06 wsec_total=2.00e+01 wsec_step=2.31e-02 zone-cycles/wsec=1.36e+06 wsec_AMR=2.34e-04
Dynamically adding 16 buffers of size 18 to a pool with current size 468 and future size 484
Dynamically adding 16 buffers of size 126 to a pool with current size 532 and future size 548
cycle=566 time=9.9492187499998974e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.43e+06 wsec_total=2.00e+01 wsec_step=2.22e-02 zone-cycles/wsec=9.09e+05 wsec_AMR=1.28e-02

In total, we have 25 separate allocation events (all reserving, non dynamic) over the runtime for pack_szize=-1 and 70 (4 reserving and 66 dynamic) for pack_size=1.

How does that solution look?
If people are happy, I'll cleaup the PR (i.e., add a changelog and remove the debug std:cerr).

Finally, small related question: What's going on with this line (I understand the first part of the conditional, but not the second -- or more specifically, what are the circumstances where the second one is false):
if (pmb->gid == nb.gid && nb.offsets.IsCell()) buf_size = 0;

Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach makes sense to me.

Comment on lines 95 to 97
std::cerr << "Reserving " << new_buffers_req << " new buffers of size " << buf_size
<< " to pool with " << pool.NumBuffersInPool() << " buffers because "
<< nbufs.at(buf_size) << " are required in total.\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a flag to possibly suppress the std::cerr outputs?

@lroberts36
Copy link
Collaborator

Alright, I think I figured out (and understand) more now. The original BuildBoundaryBufferSubset never actually added objects to the pool, it only created a pool with a callback lambda that would eventually add objects to the pool dynamically, right?

Yes, this is correct.

How does that solution look? If people are happy, I'll cleaup the PR (i.e., add a changelog and remove the debug std:cerr).

This solution looks reasonable to me. Sorry, saw the comment about removing std::cerr after reviewing.

Finally, small related question: What's going on with this line (I understand the first part of the conditional, but not the second -- or more specifically, what are the circumstances where the second one is false): if (pmb->gid == nb.gid && nb.offsets.IsCell()) buf_size = 0;

Multigrid logic requires blocks sending messages to themselves (since the same block can show up on two multigrid levels). This doesn't require any data transfer, so the message size can be zero. It is essentially just a flag to show that the block is done being used on one level and can be used on the next level.

@pgrete
Copy link
Collaborator Author

pgrete commented Aug 22, 2024

I cleaned up the PR. This is ready for a second review now.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me. I appreciate the new detailed comments.

@Yurlungur Yurlungur enabled auto-merge August 22, 2024 16:45
@Yurlungur Yurlungur merged commit 43380cd into develop Aug 22, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants