This repository contains an implementation of a compact hashing based neighborhood search for 1D, 2D and 3D data for pyTorch using a C++/CUDA backend.
Requirements:
pyTorch >= 2.0 numpy (not used in the computations) subprocess (for compilation)
The module is built just-in-time on first import in a given python environment and this build process may take a few (<5) minutes. Note that for MacOS based systems an external clang compiler installed via homebrew is required for openMP support.
This package provices two primary functions radius
and radiusSearch
. radius
is designed as a drop-in replacement of torch cluster's radius function, whereas radiusSearch is the preferred usage. Important: radius
and radiusSearch
return index pairs in flipped order!
The radiusSearch
method is defined as follows (radius
adds an additional batch_x
and batch_y
argument after support for compatibility)
def radiusSearch(
queryPositions : torch.Tensor,
referencePositions : Optional[torch.Tensor],
support : Union[float, torch.Tensor,Tuple[torch.Tensor, torch.Tensor]],
mode : str = 'gather',
domainMin : Optional[torch.Tensor] = None,
domainMax : Optional[torch.Tensor] = None,
periodicity : Optional[Union[bool, List[bool]]] = None,
hashMapLength = 4096,
algorithm: str = 'naive',
verbose: bool = False,
returnStructure : bool = False
)
-
queryPositions
is an$n_x xd$ Tensor that contains the set of points that are related to the other set -
referencePositions
is an$n_y xd$ Tensor that contains the reference set of points, i.e., the points for which relations are queried -
support
determines the cut-off radius for the radius search. This value is either a scalar float, i.e., every point has an identical cut-off radius, a single Tensor of size$n$ that contains a different cut-off radius for every point inqueryPositions
or a tuple of Tensors, one for each point set. -
mode
determines the method used to compute the cut-off radius of point to point interactions. Options are (a)gather
, which uses only the cut-off radius for thequeryPositions
, (b)scatter
, which uses only the cut-off radius for thereferencePositions
and (c)symmetric
, which uses the mean cut-off radius. -
domainMin
anddomainMax
are required for periodic neighborhood searches to define the coordinates at which point the positions wrap around -
periodicity
indicates if a periodic neighborhood search is to be performed as either a bool (applied to all dimensions) or a list of bools (one per dimension) -
hashMapLength
is used to determine the internal length of the hash map used in the compact data structure, should be close to$n_x$ -
verbose
prints additional logging information on the console -
returnStructure
decides if thecompact
algorithm should return its datastructure for reuse in later searches
For the algorithm the following 4 options exist:
-
naive
: This algorithm computes a dense distance matrix of size$n_x \times n_y \times d$ and performs the adjacency computations on this dense representation. This requires significant amounts of memory but is very straight forward and potentially differentiable. Complexity:$\mathcal{O}\left(n^2\right)$ -
cluster
: This is a wrapper around torch_cluster'sradius
search and only available if that package is installed. Note that this algorithm does not support periodic neighbor searches and does not support non-uniform cut-off radii with a complexity of$\mathcal{O}\left(n^2\right)$ . This algorithm is also limited to a fixed number of maximum neighbors ($256$ ). -
small
: This algorithm is similar tocluster
in its implementation and computes an everything against everything distance on-the-fly, i.e., it does not require intermediate large storage, and first computes the number of neighbors per particle and then allocates the according memory. Accordingly, this approach is slower thancluster
but more versatile. Complexity:$\mathcal{O}\left(n^2\right)$ -
compact
: The primary algorithm of this library. This approach uses compact hashing and a cell-based datastructure to compute neighborhoods in$\mathcal{O}\left(n\log n\right)$ . The idea is based on A parallel sph implementation on multi-core cpus and the GPU approach is based on Multi-Level Memory Structures for Simulating and Rendering SPH. Note that this implementation is not optimized for adaptive simulations.
Example: Open in Google Colab
For this example we generate two separate point clouds volumeToSupport
) and
from torch-compact-radius import radiusSearch, volumeToSupport
from torch-compact-radius.util import countUniqueEntries
import torch
import platform
# Paramaters for data generation
dim = 3
periodic = True
nx = 32
targetNumNeighbors = 50
# Choose accelerator
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if platform.system() == 'Darwin':
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
# bounds for data
minDomain = torch.tensor([-1] * dim, dtype = torch.float32, device = device)
maxDomain = torch.tensor([ 1] * dim, dtype = torch.float32, device = device)
periodicity = [periodic] * dim
extent = maxDomain - minDomain
shortExtent = torch.min(extent, dim = 0)[0].item()
dx = (shortExtent / nx)
h = volumeToSupport(dx**dim, targetNumNeighbors, dim)
dy = dx
# generate particle set x
positions = [torch.linspace(minDomain[d] + dx / 2, maxDomain[d] - dx / 2, int((extent[d] - dx) / dx) + 1, device = device) for d in range(dim)]
x = torch.stack(torch.meshgrid(*positions, indexing = 'xy'), dim = -1).reshape(-1,dim).to(device)
xSupport = torch.ones(x.shape[0], device = device) * h
# generate particle set y
ypositions = [torch.linspace(-0.5 + dx / 2, 0.5 - dx / 2, int(1 // dx), device = device) for d in range(dim)]
y = torch.stack(torch.meshgrid(*ypositions, indexing = 'xy'), dim = -1).reshape(-1,dim).to(device)
ySupport = torch.ones(y.shape[0], device = device) * h * 2
i, j = radiusSearch(x, y, (xSupport, ySupport), algorithm = 'compact', periodicity = periodic, domainMin = minDomain, domainMax = maxDomain, mode = 'symmetric')
ii, ni = countUniqueEntries(i, x)
jj, nj = countUniqueEntries(j, y)
print('i:', i.shape, i.device, i.dtype)
print('ni:', ni.shape, ni.device, ni.dtype, ni)
print('j:', j.shape, j.device, j.dtype)
print('nj:', nj.shape, nj.device, nj.dtype, nj)
This should output:
i: torch.Size([700416]) cuda:0 torch.int64 ni: torch.Size([32768]) cuda:0 torch.int64 tensor([0, 0, 0, ..., 0, 0, 0], device='cuda:0') j: torch.Size([700416]) cuda:0 torch.int64 nj: torch.Size([4096]) cuda:0 torch.int64 tensor([171, 171, 171, ..., 171, 171, 171], device='cuda:0')
If you want to evaluate the performance on your system simply run scripts/benchmark.py
, which will generate a Benchmark.png
for various numbers of point counts algorithms and dimensions.
Compute Performance on GPUs for small scale problems:
3090 | A5000 |
---|---|
CPU perforamnce:
Overall GPU based performance for larger scale problems:
If you want to check if your version of this library works correctly simply run python scripts/test.py
. This simple test function runs a variety of configurations and the output will appear like this:
periodic = True, reducedSet = True, algorithm = naive device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = True, reducedSet = True, algorithm = small device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = True, reducedSet = True, algorithm = cluster device = cpu ❌❌❌❌❌❌ device = cuda ❌❌❌❌❌❌
periodic = True, reducedSet = True, algorithm = compact device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = True, reducedSet = False, algorithm = naive device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = True, reducedSet = False, algorithm = small device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = True, reducedSet = False, algorithm = cluster device = cpu ❌❌❌❌❌❌ device = cuda ❌❌❌❌❌❌
periodic = True, reducedSet = False, algorithm = compact device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = False, reducedSet = True, algorithm = naive device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = False, reducedSet = True, algorithm = small device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = False, reducedSet = True, algorithm = cluster device = cpu ✅❌❌❌❌❌ device = cuda ✅❌❌❌❌❌
periodic = False, reducedSet = True, algorithm = compact device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = False, reducedSet = False, algorithm = naive device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = False, reducedSet = False, algorithm = small device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
periodic = False, reducedSet = False, algorithm = cluster device = cpu ✅❌❌❌❌❌ device = cuda ✅❌❌❌❌❌
periodic = False, reducedSet = False, algorithm = compact device = cpu ✅✅✅✅✅✅ device = cuda ✅✅✅✅✅✅
The cluster
algorithm failing is due to a lack of support of torch_cluster`s implementation for periodic neighborhood searches as well as searches with non-uniform cut-off radii.
Add AMD Support Wrap periodic neighborhood search and non symmetric neighborhoods around torch cluster Add automatic choice of algorithm based on performance Add binary distributions