Scalable counter #179

jerryct · 2018-10-23T14:40:26Z

No description provided.

jerryct · 2018-10-23T14:47:10Z

Hi. I would like to propose a new implementation of Counter to make it

scalable upon multi-threading
single-threaded performance roughly the same as without any synchronisation

Here are some performance measurements:

4 threads

$ ./benchmarks --benchmark_filter='BM_(Counter|Gauge)'

Run on (4 X 3400 MHz CPU s)
2018-10-23 16:02:33
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_Increment                              3 ns          3 ns  262704325
BM_Counter/ConcurrentIncrement/threads:4          1 ns          3 ns  207105724
BM_Counter_Collect                              264 ns        264 ns    2599009
BM_Gauge_Increment                               14 ns         14 ns   50683315
BM_Gauge/ConcurrentIncrement/threads:4           85 ns        339 ns    2513412
BM_Gauge_Decrement                               14 ns         14 ns   51979167
BM_Gauge_SetToCurrentTime                        18 ns         18 ns   39151309
BM_Gauge_Collect                                  6 ns          6 ns  116305932

16 threads

Run on (16 X 3600 MHz CPU s)
2018-10-23 16:08:53
---------------------------------------------------------------------------------
Benchmark                                          Time           CPU Iterations
---------------------------------------------------------------------------------
BM_Counter_Increment                               3 ns          3 ns  278594326
BM_Counter/ConcurrentIncrement/threads:16          0 ns          3 ns  259262288
BM_Counter_Collect                               195 ns        195 ns    3618464
BM_Gauge_Increment                                14 ns         14 ns   50904293
BM_Gauge/ConcurrentIncrement/threads:16           92 ns       1471 ns     480976
BM_Gauge_Decrement                                14 ns         14 ns   52015530
BM_Gauge_SetToCurrentTime                         15 ns         15 ns   45556675
BM_Gauge_Collect                                   6 ns          6 ns  125183114

All measurements of BM_Counter.* are the new implementation - the BM_Gauge.* are the old implementations. So in the following I will compare these two implementations.

The single-threaded performance BM_Counter_Increment vs. BM_Gauge_Increment is roughly 4 times faster. I also added a baseline BM_Counter_IncrementBaseline which is a single threaded implementation with no synchronisation. Without synchronisation it is only roughly 2 times faster:

Run on (4 X 3400 MHz CPU s)
2018-10-23 18:14:19
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_IncrementBaseline                      2 ns          2 ns  571664704
BM_Counter_Increment                              3 ns          3 ns  259162781

Comparing BM_Counter/ConcurrentIncrement vs. BM_Gauge/ConcurrentIncrement shows how well it scales in the context of multiple threads: On a 4 core machine it is roughly 100 times faster - on a 16 core machine it is roughly 500 times faster.

The work is shifted to the Collect function, which now takes longer. But I think this is ok because collecting/scraping the counters shouldn't be done in high frequencies. The Collect function can be made faster by reducing the size of the array in counter.h. The size of the array determines the number of concurrent threads and can be made a template parameter.

What do you think?

jerryct · 2018-10-23T15:06:23Z

Regarding the implementation:

I intentionally implemented almost everything as inline functions in the header, because it made the benchmark faster (I can provide numbers if needed)
The per thread counters are of type CacheLine to avoid false sharing with multiple threads. Also this is measurable. The magic number 128 for the alignment could be a (template) parameter (128 Bytes is the size of the cache line of PowerPC).
The only point which should be mentioned and makes the implementation slightly less convenient than the previous implementation is that the per thread counters are a fixed size array. I tried an std::vector put this will not work without further synchronization which makes the whole change pointless. Essentially it means that one Counter instantiation can only be used with 256 different threads over the lifetime of the Counter object. This 256 can be made a parameter. But in general I think this is not a problem because I expect most systems will use something like a thread pool so threads are reused instead of created all the time.

jerryct · 2018-11-01T11:52:31Z

It bothered me a little bit why the trivial implementation is still a factor of 2 faster than this implementation. I found out that the reason is only the usage of a per thread counter array. The index into the array is only known at run time and thus this indexing causes the performance lost. But I think we should stick with the array because other implementations would involve something like a thread registration by the user. So I think the array is a good compromise and the performance is nevertheless great.

Implementation is based on chapter 5.2.2 of Paul E. McKenney (2017), "Is Parallel Programming Hard, And, If So, What Can You Do About It?"

gjasny · 2018-11-03T17:57:40Z

Sorry, it took a while until I found some time to look into this PR. I have two major comments:

AFAIR some platforms (mostly mobile) have problems with thread_local storage. I therefore would not make this Counter implementation the default one. Instead I could think of making this code available as standalone ConcurrentCounter implementation.
I don't like my daemons to crash only because a thread limit was reached. Instead I'd propose code like in c187436. I use std::hash to hash the thread-id into a bucket-id. Chances that two busy threads end up in the same bucket exist, but the thread number is limitless. If you find the probability of hash collisions unacceptable I could think of making the bucket-id retrieval function a policy or such. On my macbook replacing thread_local with always retrieving the std::this_thread::get_id() was within measuring tolerance.

Thanks,
Gregor

gjasny · 2018-11-03T18:11:20Z

I just realized that code with collisions won't work with:

    const double new_value{c.v.load(std::memory_order_relaxed) + v};
    c.v.store(new_value, std::memory_order_relaxed);

jupp0r

Thanks for thinking about this. My general thoughts:

The current implementation strikes a balance between memory usage and runtime performance. This change buys some (4x) runtime performance in situations with contention on a counter with lots of memory (256x) overhead. That might be a really good solution for some workloads, but not a good one for others (think embedded devices). Making that change the default for everybody assumes that everybody is ok with paying the memory cost for this because they gain performance.

I don't think that assumption is valid. Very few people will have the problem of counters being too slow for them. In addition to that, the Counter implementation is used in Histogram as well, having a large impact on their memory footprint.

In general, buying metrics observation performance with collection performance is very much in the spirit of this library. I think @gjasny made the right suggestion in extracting this into its own class and making it opt-in instead of changing the default behavior. An even better solution would be to combine the current behavior and let people make tradeoff choices:

make the thread limit configurable at runtime or compile time
group threads into hash buckets, busy wait inside of a bucket like the current implementation and increase performance by decreasing the likelyhood of contention

shablov · 2020-04-03T13:28:32Z

core/include/prometheus/counter.h

+    const int id{count_.fetch_add(1)};
+
+    if (id >= per_thread_counter_.size()) {
+      std::terminate();


You may initialize zero element in per_thread_counter_ and use it if threads count goes beyond of array. But this does not work with load/store scheme. Why are not you using fetch_add with relaxed policy?

freedge · 2020-04-11T18:03:50Z

core/include/prometheus/counter.h

+
+  void IncrementUnchecked(const double v) {
+    CacheLine& c = per_thread_counter_[ThreadId()];
+    const double new_value{c.v.load(std::memory_order_relaxed) + v};


is it possible to remove the load there?
We can keep a second counter (eg: Cacheline.localV) that is not atomic (and does not need to be, as it is only accessed from this thread). Only the store needs to be atomic, right?

(just passing by. thanks for the book recommendation!)

jerryct force-pushed the scalable_counter branch from ebcc160 to 0c06758 Compare October 23, 2018 14:55

jerryct force-pushed the scalable_counter branch 5 times, most recently from 9f2bb20 to 0591e59 Compare October 24, 2018 18:11

gjasny requested a review from jupp0r October 24, 2018 20:16

jerryct force-pushed the scalable_counter branch from 0591e59 to 053d427 Compare November 1, 2018 11:47

jerryct force-pushed the scalable_counter branch 3 times, most recently from 26b630e to bbc6360 Compare November 1, 2018 17:49

jerryct added 3 commits November 1, 2018 19:43

Apply clang-format

757e694

Implement scalable counter

0b7b822

Implementation is based on chapter 5.2.2 of Paul E. McKenney (2017), "Is Parallel Programming Hard, And, If So, What Can You Do About It?"

Add increment benchmark as baseline

535e481

jerryct force-pushed the scalable_counter branch from bbc6360 to 535e481 Compare November 1, 2018 18:44

jupp0r requested changes Nov 5, 2018

View reviewed changes

shablov reviewed Apr 3, 2020

View reviewed changes

freedge reviewed Apr 11, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable counter #179

Scalable counter #179

jerryct commented Oct 23, 2018

jerryct commented Oct 23, 2018 •

edited

Loading

jerryct commented Oct 23, 2018 •

edited

Loading

jerryct commented Nov 1, 2018

gjasny commented Nov 3, 2018 •

edited

Loading

gjasny commented Nov 3, 2018

jupp0r left a comment

shablov Apr 3, 2020

freedge Apr 11, 2020

Scalable counter #179

Are you sure you want to change the base?

Scalable counter #179

Conversation

jerryct commented Oct 23, 2018

jerryct commented Oct 23, 2018 • edited Loading

jerryct commented Oct 23, 2018 • edited Loading

jerryct commented Nov 1, 2018

gjasny commented Nov 3, 2018 • edited Loading

gjasny commented Nov 3, 2018

jupp0r left a comment

Choose a reason for hiding this comment

shablov Apr 3, 2020

Choose a reason for hiding this comment

freedge Apr 11, 2020

Choose a reason for hiding this comment

jerryct commented Oct 23, 2018 •

edited

Loading

jerryct commented Oct 23, 2018 •

edited

Loading

gjasny commented Nov 3, 2018 •

edited

Loading