Alternative CUDA implementation of seed finding #230

stephenswat · 2022-09-11T22:05:52Z

This is a work-in-progress CUDA implementation of an alternative seed finding algorithm based on dynamic range searching.

stephenswat · 2022-09-21T15:57:11Z

This pull request is now ready for review. This implementation avoids the additional space and compute time incurred by the counting kernels, and does not use binning. It's still a relatively simple implementation with lots of optimization to be done, but the performance results are reasonable so far, competitive with the existing CUDA code:

Here is the nsight systems profile for ⟨μ⟩ = 300:

beomki-yeo · 2022-09-23T01:02:55Z

I will take a look into this tomorrow. BTW, do you also have a CPU benchmark?

stephenswat · 2022-09-23T09:08:29Z

Here is a plot including CPU seeding. Hope this is what you meant. 🙂

I should note that the benchmark for this new seeding includes the time to convert data from the traccc EDM to a flat data format, which is not really part of the actual execution time. That accounts for roughly one third to one half of the total wall-clock time measured here.

stephenswat · 2022-09-23T11:30:09Z

Here are some additional plots including an optimistic estimate for the amortized run time over embarrassingly parallel CPU execution, both on atspot01 and on a DAS-6 node with an A100. Sadly, there is a lot of work to do before any of our GPU seeding algorithms become performance-competitive...

beomki-yeo

I don't have particular comments but would like to check if I understood the implementation correctly: kd_tree is used to find the leaf cells satisfying [r,phi,z] conditions for a given spacepoint and combine the spacepoints from those cells to make seed?

beomki-yeo · 2022-09-23T20:19:19Z

device/cuda/src/seeding2/kernels/seed_finding_kernel.cu

+ *
+ * @return True iff the pair is valid.
+ */
+__device__ bool is_valid_pair(const seedfinder_config finder_conf,


This function looks very similar to doublet_finding_helper::isCompatible. Is it possible to utilize it?

beomki-yeo · 2022-09-23T21:50:48Z

Here is a plot including CPU seeding. Hope this is what you meant.

Ah yes that is what I meant. Thanks for additional plots!
I also have observed long time ago that the original CUDA seeding gets slower with the high-end chips, which is very unsatisfactory. It guess it is not fully utilizing the GPU resource in the device.

krasznaa

I'm really keen on getting this one in. In a format that would allow later on to make it work with other languages (Kokkos, SYCL) as well. Algorithm-wise I don't really have a comment...

krasznaa · 2022-10-10T14:56:17Z

device/cuda/include/traccc/cuda/seeding2/kernels/kd_tree_kernel.hpp

@@ -0,0 +1,24 @@
+/** TRACCC library, part of the ACTS project (R&D line)


Does this need to be a public header? Should it not just live in the src/ directory?

In principle yes, I was in two minds about this because generalized accelerator structure construction code may be useful elsewhere, but we can put it in a private header also.

krasznaa · 2022-10-10T14:57:11Z

device/cuda/include/traccc/cuda/seeding2/seed_finding.hpp

@@ -0,0 +1,35 @@
+/** TRACCC library, part of the ACTS project (R&D line)


I guess it's really just this header that strictly needs to be public, no? 🤔

krasznaa · 2022-10-10T14:58:10Z

device/cuda/include/traccc/cuda/seeding2/types/internal_seed.hpp

+    std::array<uint32_t, 3> spacepoints;
+    float weight;
+
+    __host__ __device__ static internal_seed Invalid() {


Even if it's made into a private header, I would still prefer to see TRACCC_HOST_AND_DEVICE here.

krasznaa · 2022-10-10T15:01:43Z

device/cuda/src/seeding2/seed_finding.cu

+    std::vector<internal_spacepoint<spacepoint>> spacepoints;
+
+    for (std::size_t i = 0; i < p.get_items().size(); ++i) {
+        for (std::size_t j = 0; j < p.get_items().at(i).size(); ++j) {
+            const traccc::spacepoint& sp = p.get_items().at(i).at(j);
+
+            spacepoints.push_back(
+                internal_spacepoint<spacepoint>(p, {i, j}, {0.f, 0.f}));
+        }
+    }


Would it not be more efficient to pre-compute the number of elements, allocate enough memory for them in one go, and then not re-allocate all the time during the filling?

In principle yes, but on the other hand this is just data pre-processing that would not really exist in any sort of production code, so it's not performance-relevant per sé.

I don't understand. 😕 The container structures that we use in this project did not come out of thin air. We have the container layout for spacepoints that we have because in ATLAS we use such a layout. So such conversions may very well need to stay with us in the long term.

I agree that these data conversions make some sense when moving data from the host to the device, but obviously such movements would be minimized in a production setting, and it makes little sense for data produced by another GPU algorithm to be in such a format. So I would consider this more of an edge case than something that would be executed with non-negligible frequency. 😕

krasznaa · 2022-10-10T15:03:37Z

device/cuda/src/seeding2/seed_finding.cu

+    vecmem::unique_alloc_ptr<internal_spacepoint<spacepoint>[]>
+        spacepoints_device =
+            vecmem::make_unique_alloc<internal_spacepoint<spacepoint>[]>(
+                mr, n_sp, spacepoints.data(),
+                [](internal_spacepoint<spacepoint>* dst,
+                   const internal_spacepoint<spacepoint>* src,
+                   std::size_t bytes) {
+                    CUDA_ERROR_CHECK(
+                        cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice));
+                });
+
+    return {std::move(spacepoints_device), n_sp};


Why not a vecmem::data::vector_buffer<traccc::internal_spacepoint<traccc::spacepoint> >? Then you wouldn't need to invent a custom data type here.

Not sure I understand the comment here, this is using vecmem datatypes, nothing new?

You're passing an "owning pointer" and a size in an std::tuple here. Instead of just using vecmem::data::vector_buffer, which is exactly meant for this type of data storage...

stephenswat added feature New feature or request cuda Changes related to CUDA labels Sep 11, 2022

stephenswat self-assigned this Sep 11, 2022

stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from a208c63 to 2799ec6 Compare September 21, 2022 15:32

stephenswat marked this pull request as ready for review September 21, 2022 15:57

stephenswat requested a review from krasznaa September 21, 2022 15:57

stephenswat force-pushed the feat/cuda_seeding_2 branch from 2799ec6 to b9218a2 Compare September 22, 2022 11:54

beomki-yeo reviewed Sep 23, 2022

View reviewed changes

beomki-yeo mentioned this pull request Oct 4, 2022

feat: detray ranges acts-project/detray#303

Merged

krasznaa requested changes Oct 10, 2022

View reviewed changes

stephenswat removed their assignment Jan 11, 2023

stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from 9268c6b to fd38be0 Compare February 16, 2023 19:29

stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from d22ab31 to 5bc9530 Compare March 8, 2023 15:23

stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from b5a9e4c to 969c0dd Compare March 16, 2023 00:16

stephenswat marked this pull request as draft March 17, 2023 12:51

stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from b461bb9 to a132f0b Compare March 17, 2023 12:59

stephenswat force-pushed the feat/cuda_seeding_2 branch 2 times, most recently from 5ad3ab2 to f7c2387 Compare March 26, 2023 18:44

test

82de631

stephenswat force-pushed the feat/cuda_seeding_2 branch from f7c2387 to c79b097 Compare June 25, 2024 14:37

UpdatE

f3c0e13

stephenswat force-pushed the feat/cuda_seeding_2 branch from c79b097 to f3c0e13 Compare June 25, 2024 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative CUDA implementation of seed finding #230

Alternative CUDA implementation of seed finding #230

stephenswat commented Sep 11, 2022 •

edited

Loading

stephenswat commented Sep 21, 2022 •

edited

Loading

beomki-yeo commented Sep 23, 2022

stephenswat commented Sep 23, 2022 •

edited

Loading

stephenswat commented Sep 23, 2022 •

edited

Loading

beomki-yeo left a comment

beomki-yeo Sep 23, 2022

beomki-yeo commented Sep 23, 2022 •

edited

Loading

krasznaa left a comment

krasznaa Oct 10, 2022

stephenswat Oct 11, 2022

krasznaa Oct 10, 2022

krasznaa Oct 10, 2022

stephenswat Oct 11, 2022

krasznaa Oct 10, 2022

stephenswat Oct 11, 2022

krasznaa Oct 11, 2022

stephenswat Oct 11, 2022

krasznaa Oct 10, 2022

stephenswat Oct 11, 2022

krasznaa Oct 11, 2022

		@@ -0,0 +1,24 @@
		/** TRACCC library, part of the ACTS project (R&D line)

		@@ -0,0 +1,35 @@
		/** TRACCC library, part of the ACTS project (R&D line)

Alternative CUDA implementation of seed finding #230

Are you sure you want to change the base?

Alternative CUDA implementation of seed finding #230

Conversation

stephenswat commented Sep 11, 2022 • edited Loading

stephenswat commented Sep 21, 2022 • edited Loading

beomki-yeo commented Sep 23, 2022

stephenswat commented Sep 23, 2022 • edited Loading

stephenswat commented Sep 23, 2022 • edited Loading

beomki-yeo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beomki-yeo commented Sep 23, 2022 • edited Loading

krasznaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenswat commented Sep 11, 2022 •

edited

Loading

stephenswat commented Sep 21, 2022 •

edited

Loading

stephenswat commented Sep 23, 2022 •

edited

Loading

stephenswat commented Sep 23, 2022 •

edited

Loading

beomki-yeo commented Sep 23, 2022 •

edited

Loading