-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454
base: main
Are you sure you want to change the base?
feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454
Conversation
…sonal repository Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Thank you for contributing to the Autoware project! 🚧 If your pull request is in progress, switch it to draft mode. Please ensure:
|
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
…pointcloud changes after the first iteration Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
…ntcloud_preprocessing
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the amazing PR, these performance improvements are desperately needed.
I haven't checked the PR for functionality yet, but I'll leave my first round of comments here.
The main points I'd like to address are
- memory safety and idiomatic C++ (there is currently a lot of raw-pointer code which should be avoided whenever possible)
- modulatiry: currently the pipeline is hard-coded and all in one place. This makes the module hard to adapt to different projects, and hard to maintain individual modules in the pipeline.
Thank you for your time!
self_crop.min_x: 1.0 | ||
self_crop.min_y: 1.0 | ||
self_crop.min_z: 1.0 | ||
self_crop.max_x: -1.0 | ||
self_crop.max_y: -1.0 | ||
self_crop.max_z: -1.0 | ||
mirror_crop.min_x: 1.0 | ||
mirror_crop.min_y: 1.0 | ||
mirror_crop.min_z: 1.0 | ||
mirror_crop.max_x: -1.0 | ||
mirror_crop.max_y: -1.0 | ||
mirror_crop.max_z: -1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having two hard-coded crop-box filters here, a list would be more easily extensible and it should be quite straight-forward to change the implementation.
sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md
Outdated
Show resolved
Hide resolved
sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md
Outdated
Show resolved
Hide resolved
|
||
std::size_t max_ring = 0; | ||
|
||
for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.
num_rings_ = std::max(num_rings_, static_cast<std::size_t>(16)); | ||
std::vector<std::size_t> ring_points(num_rings_, 0); | ||
|
||
for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.
max_ring = std::max(max_ring, ring); | ||
} | ||
|
||
// Set max rings to the next power of two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly kind of a niche problem, but not all sensors (Pandar40P) have 2^n
rings.
Although auto-detecting the number of rings is nice, it has no hard guarantee to be accurate (e.g. the sensor is under a cover when turned on and there are thus no points in the cloud).
Does cuda_pointcloud_preprocessor support changing dimenions of input pointclouds across iterations (e.g. starts with 0 rings in cloud 1, then 64 rings with 2000 points, then 64 rings with 5000 points each)?
If not, I'd suggest to make n_rings
and max_points_per_ring
parameters so that we can guarantee correct behavior at runtime.
bool CudaOrganizedPointcloudAdapterNode::orderPointcloud( | ||
const sensor_msgs::msg::PointCloud2::ConstSharedPtr input_pointcloud_msg_ptr) | ||
{ | ||
const autoware::point_types::PointXYZIRCAEDT * input_buffer = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about bounds/type checking as above 🙇
if (idx < num_points && masks[idx] == 1) { | ||
output_points[indices[idx] - 1] = input_points[idx]; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions are identical except for their argument types. Consider making one templated function instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knzo25
Thank you very much for proposing a fantastic PR, and I'm sorry for taking a long time for the review. From a viewpoint of CUDA usage, I left some comments. I'd appreciate it if you could consider them.
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
TransformStruct transform) | ||
{ | ||
int idx = blockIdx.x * blockDim.x + threadIdx.x; | ||
if (idx < num_points) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (idx < num_points) { | |
if (num_points <= idx) { | |
return; | |
} | |
{ |
I would suggest applying "early-return" to threads that have out-of-range idx
to prevent warp divergence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would warps not diverge anyways?
I had understood that under warp divergence, the cost would simply be the cost of all paths
https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2x2.pdf
Eons ago I remember that for old CPU compilers the most common path should go first. I know that GPUs do not have branch prediction, which renders this concern null, but this became a habit.
(one of the reasons I wanted to write it like this was to put later strided loops, which may affect the coordinate transform, avoiding having to load the transformation at every instance)
} | ||
|
||
__global__ void cropBoxKernel( | ||
InputPointType * d_points, uint32_t * output_mask, int num_points, float min_x, float min_y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputPointType * d_points, uint32_t * output_mask, int num_points, float min_x, float min_y, | |
InputPointType * __restrict__ d_points, uint32_t * output_mask, int num_points, float min_x, float min_y, |
Same as above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in another comment, I planned to leave __restrict__
and strided loops for the next PR, but for the __reestrict__
there is no much meaning in waiting.
That being said, I had heard that the compiler will only consider the keyword if all the pointers have it. Is that true?
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
uint32_t * mask1, uint32_t * mask2, uint32_t * mask3, int num_points, uint32_t * output_mask) | ||
{ | ||
int idx = blockIdx.x * blockDim.x + threadIdx.x; | ||
if (idx < num_points) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (idx < num_points) { | |
if (num_points <= idx) { | |
return; | |
} | |
{ |
early-return same as the above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied in an earlier comment (just to signal that it is being discussed elsewhere)
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
float theta = twist.cum_theta; | ||
|
||
double dt_nsec = | ||
point.time_stamp > twist.last_stamp_nsec ? point.time_stamp - twist.last_stamp_nsec : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point.time_stamp > twist.last_stamp_nsec ? point.time_stamp - twist.last_stamp_nsec : 0; | |
// point.time_stamp > twist.last_stamp_nsec ? point.time_stamp - twist.last_stamp_nsec : 0; | |
static_cast<uint32_t>(point.time_stamp > twist.last_stamp_nsec) * (point.time_stamp - twist.last_stamp_nsec); |
This may decrease readability but be preferred in terms of all threads executes the same instruction without branch. I believe other ternary operators in other CUDA kernels can be replaced in a similar way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK your suggestion would not be faster.
Branching is masked and not slower (if I understood correctly).
From what I saw on forums, the compiler will decide to use a branch or fmax(point.time_stamp - twist.last_stamp_nsec, 0.0)
depending on the architecture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old evidence:
https://gist.github.com/allanmac/5125376
|
||
// Find biggest walk that passes through i | ||
float azimuth_diff = right_point.azimuth - left_point.azimuth; | ||
azimuth_diff = azimuth_diff < 0.f ? azimuth_diff + 2 * M_PI : azimuth_diff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
azimuth_diff = azimuth_diff < 0.f ? azimuth_diff + 2 * M_PI : azimuth_diff; | |
azimuth_diff = azimuth_diff + (2 * M_PI) * static_cast<float>(azimuth_diff < 0.f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
// Copy to device | ||
device_twist_3d_structs_ = host_twist_structs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this code may cause GPU memory fragmentation and/or repeated implicit GPU memory allocation, both of which harm performance. To prevent such kind of issues, I would suggest using CUDA's memory pool functionality. I'll add suggestion code to the constructor of CudaPointcloudPreprocessor
.
point_fields_.push_back(intensity_field); | ||
point_fields_.push_back(return_type_field); | ||
point_fields_.push_back(channel_field); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
cudaMemPoolProps pool_props; | |
memset(&pool_props, 0, sizeof(cudaMemPoolProps)); | |
pool_props.allocType = cudaMemAllocationTypePinned; | |
pool_props.handleTypes = cudaMemHandleTypePosixFileDescriptor; | |
pool_props.location.type = cudaMemLocationTypeDevice; | |
cudaGetDevice(&(pool_props.location.id)); | |
// cudaMemPool_t device_memory_pool_ needs to be declared as a member of this class | |
cudaMemPoolCreate(&device_memory_pool_, &pool_props); | |
MemoryPoolAllocator<TwistStruct2D> allocator_2d(device_memory_pool_); | |
MemoryPoolAllocator<TwistStruct3D> allocator_3d(device_memory_pool_); | |
device_twist_2d_structs_ | |
= thrust::device_vector<TwistStruct2D, MemoryPoolAllocator<TwistStruct2D>>(allocator_2d); | |
device_twist_3d_structs_ | |
= thrust::device_vector<TwistStruct3D, MemoryPoolAllocator<TwistStruct3D>>(allocator_3d); | |
} | |
template <typename T> | |
class MemoryPoolAllocator { | |
public: | |
using value_type = T; | |
MemoryPoolAllocator(cudaMemPool_t pool) : m_pool(pool) {} | |
T* allocate(std::size_t n) { | |
void* ptr = nullptr; | |
cudaMallocFromPoolAsync(&ptr, n * sizeof(T), m_pool, cudaStreamDefault); | |
return static_cast<T*>(ptr); | |
} | |
void deallocate(T* ptr, std::size_t) { | |
cudaFreeAsync(ptr, cudaStreamDefault); | |
} | |
protected: | |
cudaMemPool_t m_pool; | |
}; // MemoryPoolAllocator |
This is a suggestion to apply CUDA's memory pool functionality to the thrust::device_vector
exploiting custom allocator capability.
(I just confirmed this code can pass compilation but not performed test run. If you feel this suggestion is appropriate via another PR, please let me know)
…loud-preprocessor.md Co-authored-by: Max Schmeller <[email protected]>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <[email protected]>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <[email protected]>
…loud-preprocessor.md Co-authored-by: Max Schmeller <[email protected]>
Co-authored-by: Max Schmeller <[email protected]>
Co-authored-by: Max Schmeller <[email protected]>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Max Schmeller <[email protected]>
Description
This PR is part of a series of PRs that aim to accelerate the Sensing/Perception pipeline through an appropriate use of CUDA.
List of PRs:
To use these branches, the following additions to the
autoware.repos
are necessary:Depending on your machine and how many nodes are in a container, the following branch may also be required:
https://github.com/knzo25/launch_ros/tree/fix/load_composable_node
There seems to be a but in ROS where if you send too many services at once some will be lost and
ros_launch
can not handle that.Related links
Parent Issue:
How was this PR tested?
The sensing/perception pipeline was tested until centerpoint for TIER IV's taxi using the logging simulator.
The following tests were executed in a laptop equipped with a RTX 4060 (laptop) GPU and a Intel(R) Core(TM) Ultra 7 165H (22 cores)
Notes for reviewers
The main branch that I used for development is
feat/cuda_acceleration_and_transport_layer
.However, the changes were too big so I split the PRs. That being said, development, if any will still be on that branch (and then cherrypicked to the respective PRs), and the review changes will be cherrypicked into the development branch.
Interface changes
An additional topic is added to perform type negotiation:
Example:
input/pointcloud
->input/pointcloud
andinput/pointcloud/cuda
Effects on system behavior
Enabling this preprocessing in the launchers should provide a much reduced latency and cpu usage (at the cost of a higher GPU usage)