-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Patatrack CA #34250
Speed up Patatrack CA #34250
Conversation
@cmsbuild, please test |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34250/23527
|
A new Pull Request was created by @VinInn (Vincenzo Innocente) for master. It involves the following packages: HeterogeneousCore/CUDAUtilities @perrotta, @makortel, @fwyzard, @jpata, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@VinInn I am watching the whole cmssw, no need to tag me explicitly. |
@cmsbuild , enable GPU |
enable gpu |
@smuzaffar |
No @slava77 , bot does not react to these |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-01daf9/16241/summary.html Comparison SummarySummary:
|
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-01daf9/16245/summary.html GPU Comparison SummarySummary:
Comparison SummarySummary:
|
template <typename T1, typename T2> | ||
T1 atomicCAS_block(T1* address, T1 compare, T2 val) { | ||
return atomicCAS(address, compare, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my education,
what does this and the other _block
methods do in this PR, especially in the context of speeding things up.
They restrict the scope of the atomic operations to a single CUDA block.
For "shared" memory this is always the case.
For "global" memory this may speed up the operation, for example by not
requiring to consider the cache of other blocks
See
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
for more details.
|
GPU tracking is known NOT to be fully reproducible for a variety of reasons that have been reported earlier in various meetings. (CPU tracking would also not be reproducible if the "order" of seeds/iterations whould change (say because of high granularity threading) |
OK, thanks for clarifying. It was not particularly obvious that "no regression expected. no regression observed" was not a literal statement. |
+reconstruction
|
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
@@ -218,12 +218,13 @@ namespace gpuClustering { | |||
auto l = nn[k][kk]; | |||
auto m = l + firstPixel; | |||
assert(m != i); | |||
auto old = atomicMin(&clusterId[m], clusterId[i]); | |||
auto old = atomicMin_block(&clusterId[m], clusterId[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterId is in global memory, but "m" (and "i") below refers to digis on the module that is reconstructed in this block. (one module per block, one block per module)
@@ -350,7 +350,9 @@ __global__ void kernel_find_ntuplets(GPUCACell::Hits const *__restrict__ hhp, | |||
auto const &thisCell = cells[idx]; | |||
if (thisCell.isKilled()) | |||
continue; // cut by earlyFishbone | |||
|
|||
// we require at least three hits... | |||
if (thisCell.outerNeighbors().empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply avoids the whole lot below (in particular the function call that being recursive cannot be inlined)
@@ -63,8 +63,8 @@ namespace gpuVertexFinder { | |||
assert(iv[i] >= 0); | |||
assert(iv[i] < int(foundClusters)); | |||
auto w = 1.f / ezt2[i]; | |||
atomicAdd(&zv[iv[i]], zt[i] * w); | |||
atomicAdd(&wv[iv[i]], w); | |||
atomicAdd_block(&zv[iv[i]], zt[i] * w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vertex reconstruction is performed in a single block.
@@ -46,7 +46,7 @@ namespace gpuVertexFinder { | |||
for (auto i = threadIdx.x; i < nt; i += blockDim.x) { | |||
if (iv[i] > 9990) | |||
continue; | |||
atomicAdd(&ptv2[iv[i]], ptt2[i]); | |||
atomicAdd_block(&ptv2[iv[i]], ptt2[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vertex sorting is performed in a single block
+1 |
Release
This PR
Throughput (triplets on T4)
release
702.0 ± 1.6 ev/s
this PR
752.4 ± 1.4 ev/s
only the first commit
750.5 ± 1.2 ev/s
Purely technical. no regression expected. no regression observed