Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative CUDA implementation of seed finding #230

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

stephenswat
Copy link
Member

@stephenswat stephenswat commented Sep 11, 2022

This is a work-in-progress CUDA implementation of an alternative seed finding algorithm based on dynamic range searching.

@stephenswat stephenswat added feature New feature or request cuda Changes related to CUDA labels Sep 11, 2022
@stephenswat stephenswat self-assigned this Sep 11, 2022
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from a208c63 to 2799ec6 Compare September 21, 2022 15:32
@stephenswat
Copy link
Member Author

stephenswat commented Sep 21, 2022

This pull request is now ready for review. This implementation avoids the additional space and compute time incurred by the counting kernels, and does not use binning. It's still a relatively simple implementation with lots of optimization to be done, but the performance results are reasonable so far, competitive with the existing CUDA code:

test

Here is the nsight systems profile for ⟨μ⟩ = 300:

Screenshot from 2022-09-21 17-55-33

@stephenswat stephenswat marked this pull request as ready for review September 21, 2022 15:57
@beomki-yeo
Copy link
Contributor

I will take a look into this tomorrow. BTW, do you also have a CPU benchmark?

@stephenswat
Copy link
Member Author

stephenswat commented Sep 23, 2022

Here is a plot including CPU seeding. Hope this is what you meant. 🙂

test

I should note that the benchmark for this new seeding includes the time to convert data from the traccc EDM to a flat data format, which is not really part of the actual execution time. That accounts for roughly one third to one half of the total wall-clock time measured here.

@stephenswat
Copy link
Member Author

stephenswat commented Sep 23, 2022

Here are some additional plots including an optimistic estimate for the amortized run time over embarrassingly parallel CPU execution, both on atspot01 and on a DAS-6 node with an A100. Sadly, there is a lot of work to do before any of our GPU seeding algorithms become performance-competitive...

plot_atspot01
plot_das6

Copy link
Contributor

@beomki-yeo beomki-yeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have particular comments but would like to check if I understood the implementation correctly: kd_tree is used to find the leaf cells satisfying [r,phi,z] conditions for a given spacepoint and combine the spacepoints from those cells to make seed?

*
* @return True iff the pair is valid.
*/
__device__ bool is_valid_pair(const seedfinder_config finder_conf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks very similar to doublet_finding_helper::isCompatible. Is it possible to utilize it?

@beomki-yeo
Copy link
Contributor

beomki-yeo commented Sep 23, 2022

Here is a plot including CPU seeding. Hope this is what you meant.

Ah yes that is what I meant. Thanks for additional plots!
I also have observed long time ago that the original CUDA seeding gets slower with the high-end chips, which is very unsatisfactory. It guess it is not fully utilizing the GPU resource in the device.

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really keen on getting this one in. In a format that would allow later on to make it work with other languages (Kokkos, SYCL) as well. Algorithm-wise I don't really have a comment...

@@ -0,0 +1,24 @@
/** TRACCC library, part of the ACTS project (R&D line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a public header? Should it not just live in the src/ directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle yes, I was in two minds about this because generalized accelerator structure construction code may be useful elsewhere, but we can put it in a private header also.

@@ -0,0 +1,35 @@
/** TRACCC library, part of the ACTS project (R&D line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's really just this header that strictly needs to be public, no? 🤔

std::array<uint32_t, 3> spacepoints;
float weight;

__host__ __device__ static internal_seed Invalid() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's made into a private header, I would still prefer to see TRACCC_HOST_AND_DEVICE here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Comment on lines 37 to 61
std::vector<internal_spacepoint<spacepoint>> spacepoints;

for (std::size_t i = 0; i < p.get_items().size(); ++i) {
for (std::size_t j = 0; j < p.get_items().at(i).size(); ++j) {
const traccc::spacepoint& sp = p.get_items().at(i).at(j);

spacepoints.push_back(
internal_spacepoint<spacepoint>(p, {i, j}, {0.f, 0.f}));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be more efficient to pre-compute the number of elements, allocate enough memory for them in one go, and then not re-allocate all the time during the filling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle yes, but on the other hand this is just data pre-processing that would not really exist in any sort of production code, so it's not performance-relevant per sé.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. 😕 The container structures that we use in this project did not come out of thin air. We have the container layout for spacepoints that we have because in ATLAS we use such a layout. So such conversions may very well need to stay with us in the long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these data conversions make some sense when moving data from the host to the device, but obviously such movements would be minimized in a production setting, and it makes little sense for data produced by another GPU algorithm to be in such a format. So I would consider this more of an edge case than something that would be executed with non-negligible frequency. 😕

Comment on lines 51 to 62
vecmem::unique_alloc_ptr<internal_spacepoint<spacepoint>[]>
spacepoints_device =
vecmem::make_unique_alloc<internal_spacepoint<spacepoint>[]>(
mr, n_sp, spacepoints.data(),
[](internal_spacepoint<spacepoint>* dst,
const internal_spacepoint<spacepoint>* src,
std::size_t bytes) {
CUDA_ERROR_CHECK(
cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice));
});

return {std::move(spacepoints_device), n_sp};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a vecmem::data::vector_buffer<traccc::internal_spacepoint<traccc::spacepoint> >? Then you wouldn't need to invent a custom data type here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the comment here, this is using vecmem datatypes, nothing new?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're passing an "owning pointer" and a size in an std::tuple here. Instead of just using vecmem::data::vector_buffer, which is exactly meant for this type of data storage...

@stephenswat stephenswat removed their assignment Jan 11, 2023
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from 9268c6b to fd38be0 Compare February 16, 2023 19:29
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from d22ab31 to 5bc9530 Compare March 8, 2023 15:23
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from b5a9e4c to 969c0dd Compare March 16, 2023 00:16
@stephenswat stephenswat marked this pull request as draft March 17, 2023 12:51
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from b461bb9 to a132f0b Compare March 17, 2023 12:59
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from 5ad3ab2 to f7c2387 Compare March 26, 2023 18:44
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch from f7c2387 to c79b097 Compare June 25, 2024 14:37
@stephenswat stephenswat force-pushed the feat/cuda_seeding_2 branch from c79b097 to f3c0e13 Compare June 25, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants