-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative CUDA implementation of seed finding #230
base: main
Are you sure you want to change the base?
Conversation
a208c63
to
2799ec6
Compare
This pull request is now ready for review. This implementation avoids the additional space and compute time incurred by the counting kernels, and does not use binning. It's still a relatively simple implementation with lots of optimization to be done, but the performance results are reasonable so far, competitive with the existing CUDA code: Here is the nsight systems profile for ⟨μ⟩ = 300: |
2799ec6
to
b9218a2
Compare
I will take a look into this tomorrow. BTW, do you also have a CPU benchmark? |
Here is a plot including CPU seeding. Hope this is what you meant. 🙂 I should note that the benchmark for this new seeding includes the time to convert data from the traccc EDM to a flat data format, which is not really part of the actual execution time. That accounts for roughly one third to one half of the total wall-clock time measured here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have particular comments but would like to check if I understood the implementation correctly: kd_tree is used to find the leaf cells satisfying [r,phi,z] conditions for a given spacepoint and combine the spacepoints from those cells to make seed?
* | ||
* @return True iff the pair is valid. | ||
*/ | ||
__device__ bool is_valid_pair(const seedfinder_config finder_conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks very similar to doublet_finding_helper::isCompatible
. Is it possible to utilize it?
Ah yes that is what I meant. Thanks for additional plots! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really keen on getting this one in. In a format that would allow later on to make it work with other languages (Kokkos, SYCL) as well. Algorithm-wise I don't really have a comment...
@@ -0,0 +1,24 @@ | |||
/** TRACCC library, part of the ACTS project (R&D line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a public header? Should it not just live in the src/
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle yes, I was in two minds about this because generalized accelerator structure construction code may be useful elsewhere, but we can put it in a private header also.
@@ -0,0 +1,35 @@ | |||
/** TRACCC library, part of the ACTS project (R&D line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's really just this header that strictly needs to be public, no? 🤔
std::array<uint32_t, 3> spacepoints; | ||
float weight; | ||
|
||
__host__ __device__ static internal_seed Invalid() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's made into a private header, I would still prefer to see TRACCC_HOST_AND_DEVICE
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
std::vector<internal_spacepoint<spacepoint>> spacepoints; | ||
|
||
for (std::size_t i = 0; i < p.get_items().size(); ++i) { | ||
for (std::size_t j = 0; j < p.get_items().at(i).size(); ++j) { | ||
const traccc::spacepoint& sp = p.get_items().at(i).at(j); | ||
|
||
spacepoints.push_back( | ||
internal_spacepoint<spacepoint>(p, {i, j}, {0.f, 0.f})); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be more efficient to pre-compute the number of elements, allocate enough memory for them in one go, and then not re-allocate all the time during the filling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle yes, but on the other hand this is just data pre-processing that would not really exist in any sort of production code, so it's not performance-relevant per sé.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. 😕 The container structures that we use in this project did not come out of thin air. We have the container layout for spacepoints that we have because in ATLAS we use such a layout. So such conversions may very well need to stay with us in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these data conversions make some sense when moving data from the host to the device, but obviously such movements would be minimized in a production setting, and it makes little sense for data produced by another GPU algorithm to be in such a format. So I would consider this more of an edge case than something that would be executed with non-negligible frequency. 😕
vecmem::unique_alloc_ptr<internal_spacepoint<spacepoint>[]> | ||
spacepoints_device = | ||
vecmem::make_unique_alloc<internal_spacepoint<spacepoint>[]>( | ||
mr, n_sp, spacepoints.data(), | ||
[](internal_spacepoint<spacepoint>* dst, | ||
const internal_spacepoint<spacepoint>* src, | ||
std::size_t bytes) { | ||
CUDA_ERROR_CHECK( | ||
cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice)); | ||
}); | ||
|
||
return {std::move(spacepoints_device), n_sp}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a vecmem::data::vector_buffer<traccc::internal_spacepoint<traccc::spacepoint> >
? Then you wouldn't need to invent a custom data type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the comment here, this is using vecmem datatypes, nothing new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're passing an "owning pointer" and a size in an std::tuple
here. Instead of just using vecmem::data::vector_buffer
, which is exactly meant for this type of data storage...
9268c6b
to
fd38be0
Compare
d22ab31
to
5bc9530
Compare
b5a9e4c
to
969c0dd
Compare
b461bb9
to
a132f0b
Compare
5ad3ab2
to
f7c2387
Compare
f7c2387
to
c79b097
Compare
c79b097
to
f3c0e13
Compare
This is a work-in-progress CUDA implementation of an alternative seed finding algorithm based on dynamic range searching.