Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nbody wall damping #33

Open
wants to merge 11 commits into
base: nanobind
Choose a base branch
from
Open

Nbody wall damping #33

wants to merge 11 commits into from

Conversation

rykerfish
Copy link
Contributor

@rykerfish rykerfish commented Jan 24, 2025

DEPENDS ON #32, MERGE AFTER. I thought doing this on top of the nanobind branch would be a good test- I ran into no issues. Closes #31. Also note: this is supposed to go into main, but I set it to the nanobind branch for the moment so all the commits from that branch don't show. Sorry for doing it a bit strange!

This branch contains the damping code for the NBody wall kernels that ensures they maintain positive definiteness. I tested it by making a check for all positive eigenvalues that failed before the damping and passed afterwards. It maintains symmetry as well.

Note that this doesn't change the reference files for the wall mobility tests- those were all generated with the minimum height exactly one hydrodynamic radius from the wall, which is right before the damping code kicks in.

I also tried to make it so that $\mathcal{M}_{UF}$ is the only block computed when torques are not given, and it makes the force computation about 2x as fast as the computation with torques. I'm assuming there's a substantial overhead that lowers it from being 4x faster than torques as you'd expect since it is only computing 1/4 of the mobility blocks. @RaulPPelaez could you take a look and see if you have recommendations for better ways to implement the speed up? My current method is probably a bit clunky.

Timings- all units are seconds

Algorithm: naive

NBody NBody_wall
force 1.7602e-03 2.9738e-03
torque 3.2945e-03 5.7371e-03

Algorithm: fast

NBody NBody_wall
force 2.5244e-03 3.3512e-03
torque 4.1480e-03 7.0016e-03

Algorithm: block

NBody NBody_wall
force 1.2875e-03 1.2934e-03
torque 2.0957e-03 2.1091e-03

@rykerfish rykerfish changed the base branch from main to nanobind January 24, 2025 23:15
rykerfish and others added 3 commits January 25, 2025 00:06
* Switch to nanobind

* Update env

* Use FetchContent for nanobind

* Formatting

* Allow for any CPU tensor to be consumed by the interface

* Add test

* Add tensor provider tests

* Update CI

* Update CI

* Update CI

* Add multidevice cached allocator. Avoids temporary memory allocations
Modify C++ interface to consume device_span instead of raw pointers
Add device_adapter, allowing solvers to consume data in their desired
device
Only ported SelfMobility

* Add the rest of the solvers

* Add spdlog and gtest
tweak CMake debug flags

* Finish allocator implementation
Update and add some tests
Fix lingering device_adaptor issues

* Small changes

* Small change

* Fix bug in Nbody

* Fix some tests.
Make sure NBody wall tests use particles far away from the wall

* Formatting

* Remove spdlog

* Update interface version

* Documentation

* Documentation

* Update python bindings

* Remove spdlog in CMake

* Update test

* Add some overloads to device_span

* Update CPP example

* Update CPP example

* Add some constructors to device_span

* Update cpp example

* Add some allocator tests

* Move some files around

* fix for compiling in double

* fix for CI to compile examples

* Tweak example CMake

* Revert "Tweak example CMake"

This reverts commit 365bea7.

---------

Co-authored-by: Ryker Fish <[email protected]>
@RaulPPelaez
Copy link
Contributor

@rykerfish, could you upload the benchmark script? Perhaps under examples/python?

@RaulPPelaez
Copy link
Contributor

At first glance, I would say that you are seeing torques being only x2 slower because these kernels are bandwidth-limited (as its usual for GPUs).
With torques you move twice the memory (loading torques and writing ang. disp), so yeah, you have x4 less computation, but computation is free compared with memory store/load.
BTW very funny that the algorithm that we call "fast" I actually the slowest. I reckon this is a low number of particles?
Also, surprising that block seems to be immune to the wall correction, you sure that is right?

@RaulPPelaez
Copy link
Contributor

The Nbody wall fluctuation test is still failing with the dampening. Perhaps it allows for some leeway, but the matrix is still non-SPD too close to the wall?

@RaulPPelaez
Copy link
Contributor

I made some changes, please check that I did not screwed up the performance.

We could squeeze some cycles here and there, but AFAICT your implementation is solid, we should leave it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NBody wall kernels lose SPD property close to a wall
2 participants