-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalable counter #179
base: master
Are you sure you want to change the base?
Scalable counter #179
Conversation
Hi. I would like to propose a new implementation of Counter to make it
Here are some performance measurements: 4 threads
16 threads
All measurements of The single-threaded performance
Comparing The work is shifted to the Collect function, which now takes longer. But I think this is ok because collecting/scraping the counters shouldn't be done in high frequencies. The Collect function can be made faster by reducing the size of the array in counter.h. The size of the array determines the number of concurrent threads and can be made a template parameter. What do you think? |
ebcc160
to
0c06758
Compare
Regarding the implementation:
|
9f2bb20
to
0591e59
Compare
0591e59
to
053d427
Compare
It bothered me a little bit why the trivial implementation is still a factor of 2 faster than this implementation. I found out that the reason is only the usage of a per thread counter array. The index into the array is only known at run time and thus this indexing causes the performance lost. But I think we should stick with the array because other implementations would involve something like a thread registration by the user. So I think the array is a good compromise and the performance is nevertheless great. |
26b630e
to
bbc6360
Compare
Implementation is based on chapter 5.2.2 of Paul E. McKenney (2017), "Is Parallel Programming Hard, And, If So, What Can You Do About It?"
bbc6360
to
535e481
Compare
Sorry, it took a while until I found some time to look into this PR. I have two major comments:
Thanks, |
I just realized that code with collisions won't work with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking about this. My general thoughts:
The current implementation strikes a balance between memory usage and runtime performance. This change buys some (4x) runtime performance in situations with contention on a counter with lots of memory (256x) overhead. That might be a really good solution for some workloads, but not a good one for others (think embedded devices). Making that change the default for everybody assumes that everybody is ok with paying the memory cost for this because they gain performance.
I don't think that assumption is valid. Very few people will have the problem of counters being too slow for them. In addition to that, the Counter implementation is used in Histogram as well, having a large impact on their memory footprint.
In general, buying metrics observation performance with collection performance is very much in the spirit of this library. I think @gjasny made the right suggestion in extracting this into its own class and making it opt-in instead of changing the default behavior. An even better solution would be to combine the current behavior and let people make tradeoff choices:
make the thread limit configurable at runtime or compile time
group threads into hash buckets, busy wait inside of a bucket like the current implementation and increase performance by decreasing the likelyhood of contention
const int id{count_.fetch_add(1)}; | ||
|
||
if (id >= per_thread_counter_.size()) { | ||
std::terminate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may initialize zero element in per_thread_counter_
and use it if threads count goes beyond of array. But this does not work with load/store scheme. Why are not you using fetch_add with relaxed policy?
|
||
void IncrementUnchecked(const double v) { | ||
CacheLine& c = per_thread_counter_[ThreadId()]; | ||
const double new_value{c.v.load(std::memory_order_relaxed) + v}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to remove the load there?
We can keep a second counter (eg: Cacheline.localV) that is not atomic (and does not need to be, as it is only accessed from this thread). Only the store needs to be atomic, right?
(just passing by. thanks for the book recommendation!)
No description provided.