-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix benchmarks #2628
Fix benchmarks #2628
Conversation
With the benchmark library, there is no need to specify an iteration count. Interestingly, 4x4 matrix multiplication is faster than affine*matrix
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2628 +/- ##
==========================================
+ Coverage 50.99% 51.10% +0.12%
==========================================
Files 387 393 +6
Lines 32308 32753 +445
==========================================
+ Hits 16473 16736 +263
- Misses 15835 16017 +182 ☔ View full report in Codecov by Sentry. |
Actually allocate the given number of states instead of just one.
a3feae5
to
708b3e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Robert, thanks for making the benchmark more realistic, and cleaning it up, this is a great change.
I don't think it changes much about performance. We reported a slow down of 4x in #2546, which is worse than what we see here. I understand it's a few more ms in the end, but we think it's a worth price to pay for additional safety. We also think we can recover these loses in many other ways, like your own moveit/moveit#3548, or what we did with Jacobians in #2389, and many other improvements we can do. We should be able to have both performant and bullet-proof core types. But to reach that point we need good benchmarks and tests, so it's great you made this benchmark better and already using it to drive improvements. Thank you!
benchmark::DoNotOptimize(result = isometry.affine() * isometry.matrix()); | ||
benchmark::ClobberMemory(); | ||
} | ||
benchmark::DoNotOptimize(result.affine().noalias() = isometry.affine() * isometry.matrix()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the internal loop you had with MATRIX_OPS_N_ITERATIONS
to get more significant differences. Otherwise this is in the order of ~10ns and the noise makes it harder to appreciate the difference. I'm fine either way, feel free to leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my old code, multiple iterations were required for some statistical significance. However, the Google benchmark lib automatically iterates a to-be-timed code snipped as long as needed to reach statistical significance. Hence, it's not needed to be done manually anymore.
BENCHMARK(multiplyMatrixTimesMatrixNoAlias); | ||
BENCHMARK(multiplyIsometryTimesIsometryNoAlias); | ||
BENCHMARK(multiplyMatrixTimesMatrix); | ||
BENCHMARK(multiplyIsometryTimesIsometry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include back the multiplyAffineTimesMatrix
benchmark, for completeness. We would then have the alias and noalias version of the three.
Also, what do you think about interleaving these, so that we see the alias and noalias version next to each other and is easier to compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested different variants. It's quite obvious that NoAlias versions are faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know about the .noalias()
. Unfortunately the default behavior is much more frequent in code, so it would be great to keep both versions of the three multiply benchmarks.
|
||
// Robot and planning group for benchmarks. | ||
constexpr char PANDA_TEST_ROBOT[] = "panda"; | ||
constexpr char PANDA_TEST_GROUP[] = "panda_arm"; | ||
constexpr char PR2_TEST_ROBOT[] = "pr2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to use the pr2 for the update() benchmark, because it has way more joints than the panda, and different types of joints (prismatic, revolute, fixed), so it will exercise the update()
call much more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to change. From my point of view it doesn't matter whether calling update()
more often or having a longer-running update()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also includes different joint types, and a more complex tree structure in general, so I think it makes sense to keep the PR2 test. But I'm happy to change this in a follow-up PR.
affine.matrix() = isometry.matrix(); | ||
|
||
Eigen::Affine3d affine = createTestIsometry(); | ||
Eigen::Affine3d result; | ||
for (auto _ : st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's any compiler optimization happening here and affecting the results. What do you think?
I'm surprised to see that the inverse of a Matrix4d is the fastest with this PR. Originally, it was the slowest, with matched with your original benchmark too.
This is what I get with this PR:
inverseIsometry3d 16.9 ns 16.9 ns
inverseAffineIsometry 18.6 ns 18.6 ns
inverseAffine 31.4 ns 31.4 ns
inverseMatrix4d 10.5 ns 10.5 ns
And this is before this change, and I believe with your original benchmark too:
InverseIsometry3d/10000000 107 ms 107 ms
InverseAffineIsometry/10000000 23.3 ms 23.3 ms
InverseAffine/10000000 52.7 ms 52.7 ms
InverseMatrix4d/10000000 109 ms 109 ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that there is some compiler optimization going on. However, I didn't manage to fix that using various variants of DoNotOptimize
. Finally I had to move on to other stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about keeping the local variable and the internal loop?
It is very suspicious that the inverse of a Matrix4d is way cheaper than the affine inverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal loop doesn't contribute anything when using Google's benchmark lib,
which does exactly that: performing the same operation in a loop for good statistics.
I can imagine that affine inverse is slower, because it cannot optimize memory allocation (needs to skip entries of last, fixed matrix row).
for (size_t i = 0; i < num; i++) | ||
result.push_back(i); | ||
|
||
std::random_device random_device; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using a deterministic random sequence instead?
Not sure if it will change anything, but at least it would create more deterministic conditions.
BENCHMARK(robotStateConstruct)->RangeMultiplier(10)->Range(100, 10000)->Unit(benchmark::kMillisecond); | ||
BENCHMARK(robotStateCopy)->RangeMultiplier(10)->Range(100, 10000)->Unit(benchmark::kMillisecond); | ||
BENCHMARK(robotStateUpdate)->RangeMultiplier(10)->Range(10, 1000)->Unit(benchmark::kMillisecond); | ||
BENCHMARK(robotStateForwardKinematics)->RangeMultiplier(10)->Range(10, 1000)->Unit(benchmark::kMillisecond); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment. I think it still makes sense to keep this one, but maybe rename to robotStateGetGlobalLinkTransform
, since that's what it calls?
Address review comments
affine.matrix() = isometry.matrix(); | ||
|
||
Eigen::Affine3d affine = createTestIsometry(); | ||
Eigen::Affine3d result; | ||
for (auto _ : st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about keeping the local variable and the internal loop?
It is very suspicious that the inverse of a Matrix4d is way cheaper than the affine inverse.
BENCHMARK(multiplyMatrixTimesMatrixNoAlias); | ||
BENCHMARK(multiplyIsometryTimesIsometryNoAlias); | ||
BENCHMARK(multiplyMatrixTimesMatrix); | ||
BENCHMARK(multiplyIsometryTimesIsometry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know about the .noalias()
. Unfortunately the default behavior is much more frequent in code, so it would be great to keep both versions of the three multiply benchmarks.
Co-authored-by: Mario Prats <[email protected]>
512e182
to
5da908f
Compare
Co-authored-by: Mario Prats <[email protected]>
Is there anything more to do for this to get merged, @marioprats? |
I think we can merge this, thanks Robert! |
I had a closer look at the benchmarks presented in #2546 today. I noticed that the increasing number of states was not allocated simultaneously, but sequentially (thus actually simply freeing and re-allocating the same memory blocks).
This PR changes this to true parallel allocation, which mimics the real behavior in a planner more closely.
This and the additional changes of #2617 yield a much stronger relative slow-down than what was shown in #2546:
Not only allocations are affected, but primarily also state updates, which occur much more frequently.
There is also a noticeable non-linearity when scaling up the amount of parallel states, which is probably due to more frequent cache misses.