Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable counter #179

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Scalable counter #179

wants to merge 3 commits into from

Conversation

jerryct
Copy link
Contributor

@jerryct jerryct commented Oct 23, 2018

No description provided.

@jerryct
Copy link
Contributor Author

jerryct commented Oct 23, 2018

Hi. I would like to propose a new implementation of Counter to make it

  • scalable upon multi-threading
  • single-threaded performance roughly the same as without any synchronisation

Here are some performance measurements:

4 threads

$ ./benchmarks --benchmark_filter='BM_(Counter|Gauge)'

Run on (4 X 3400 MHz CPU s)
2018-10-23 16:02:33
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_Increment                              3 ns          3 ns  262704325
BM_Counter/ConcurrentIncrement/threads:4          1 ns          3 ns  207105724
BM_Counter_Collect                              264 ns        264 ns    2599009
BM_Gauge_Increment                               14 ns         14 ns   50683315
BM_Gauge/ConcurrentIncrement/threads:4           85 ns        339 ns    2513412
BM_Gauge_Decrement                               14 ns         14 ns   51979167
BM_Gauge_SetToCurrentTime                        18 ns         18 ns   39151309
BM_Gauge_Collect                                  6 ns          6 ns  116305932

16 threads

Run on (16 X 3600 MHz CPU s)
2018-10-23 16:08:53
---------------------------------------------------------------------------------
Benchmark                                          Time           CPU Iterations
---------------------------------------------------------------------------------
BM_Counter_Increment                               3 ns          3 ns  278594326
BM_Counter/ConcurrentIncrement/threads:16          0 ns          3 ns  259262288
BM_Counter_Collect                               195 ns        195 ns    3618464
BM_Gauge_Increment                                14 ns         14 ns   50904293
BM_Gauge/ConcurrentIncrement/threads:16           92 ns       1471 ns     480976
BM_Gauge_Decrement                                14 ns         14 ns   52015530
BM_Gauge_SetToCurrentTime                         15 ns         15 ns   45556675
BM_Gauge_Collect                                   6 ns          6 ns  125183114

All measurements of BM_Counter.* are the new implementation - the BM_Gauge.* are the old implementations. So in the following I will compare these two implementations.

The single-threaded performance BM_Counter_Increment vs. BM_Gauge_Increment is roughly 4 times faster. I also added a baseline BM_Counter_IncrementBaseline which is a single threaded implementation with no synchronisation. Without synchronisation it is only roughly 2 times faster:

Run on (4 X 3400 MHz CPU s)
2018-10-23 18:14:19
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_IncrementBaseline                      2 ns          2 ns  571664704
BM_Counter_Increment                              3 ns          3 ns  259162781

Comparing BM_Counter/ConcurrentIncrement vs. BM_Gauge/ConcurrentIncrement shows how well it scales in the context of multiple threads: On a 4 core machine it is roughly 100 times faster - on a 16 core machine it is roughly 500 times faster.

The work is shifted to the Collect function, which now takes longer. But I think this is ok because collecting/scraping the counters shouldn't be done in high frequencies. The Collect function can be made faster by reducing the size of the array in counter.h. The size of the array determines the number of concurrent threads and can be made a template parameter.

What do you think?

@jerryct
Copy link
Contributor Author

jerryct commented Oct 23, 2018

Regarding the implementation:

  • I intentionally implemented almost everything as inline functions in the header, because it made the benchmark faster (I can provide numbers if needed)
  • The per thread counters are of type CacheLine to avoid false sharing with multiple threads. Also this is measurable. The magic number 128 for the alignment could be a (template) parameter (128 Bytes is the size of the cache line of PowerPC).
  • The only point which should be mentioned and makes the implementation slightly less convenient than the previous implementation is that the per thread counters are a fixed size array. I tried an std::vector put this will not work without further synchronization which makes the whole change pointless. Essentially it means that one Counter instantiation can only be used with 256 different threads over the lifetime of the Counter object. This 256 can be made a parameter. But in general I think this is not a problem because I expect most systems will use something like a thread pool so threads are reused instead of created all the time.

@jerryct jerryct force-pushed the scalable_counter branch 5 times, most recently from 9f2bb20 to 0591e59 Compare October 24, 2018 18:11
@gjasny gjasny requested a review from jupp0r October 24, 2018 20:16
@jerryct
Copy link
Contributor Author

jerryct commented Nov 1, 2018

It bothered me a little bit why the trivial implementation is still a factor of 2 faster than this implementation. I found out that the reason is only the usage of a per thread counter array. The index into the array is only known at run time and thus this indexing causes the performance lost. But I think we should stick with the array because other implementations would involve something like a thread registration by the user. So I think the array is a good compromise and the performance is nevertheless great.

@jerryct jerryct force-pushed the scalable_counter branch 3 times, most recently from 26b630e to bbc6360 Compare November 1, 2018 17:49
Implementation is based on chapter 5.2.2 of
Paul E. McKenney (2017), "Is Parallel Programming
Hard, And, If So, What Can You Do About It?"
@gjasny
Copy link
Collaborator

gjasny commented Nov 3, 2018

Sorry, it took a while until I found some time to look into this PR. I have two major comments:

  1. AFAIR some platforms (mostly mobile) have problems with thread_local storage. I therefore would not make this Counter implementation the default one. Instead I could think of making this code available as standalone ConcurrentCounter implementation.
  2. I don't like my daemons to crash only because a thread limit was reached. Instead I'd propose code like in c187436. I use std::hash to hash the thread-id into a bucket-id. Chances that two busy threads end up in the same bucket exist, but the thread number is limitless. If you find the probability of hash collisions unacceptable I could think of making the bucket-id retrieval function a policy or such. On my macbook replacing thread_local with always retrieving the std::this_thread::get_id() was within measuring tolerance.

Thanks,
Gregor

@gjasny
Copy link
Collaborator

gjasny commented Nov 3, 2018

I just realized that code with collisions won't work with:

    const double new_value{c.v.load(std::memory_order_relaxed) + v};
    c.v.store(new_value, std::memory_order_relaxed);

Copy link
Owner

@jupp0r jupp0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking about this. My general thoughts:

The current implementation strikes a balance between memory usage and runtime performance. This change buys some (4x) runtime performance in situations with contention on a counter with lots of memory (256x) overhead. That might be a really good solution for some workloads, but not a good one for others (think embedded devices). Making that change the default for everybody assumes that everybody is ok with paying the memory cost for this because they gain performance.

I don't think that assumption is valid. Very few people will have the problem of counters being too slow for them. In addition to that, the Counter implementation is used in Histogram as well, having a large impact on their memory footprint.

In general, buying metrics observation performance with collection performance is very much in the spirit of this library. I think @gjasny made the right suggestion in extracting this into its own class and making it opt-in instead of changing the default behavior. An even better solution would be to combine the current behavior and let people make tradeoff choices:

make the thread limit configurable at runtime or compile time
group threads into hash buckets, busy wait inside of a bucket like the current implementation and increase performance by decreasing the likelyhood of contention

const int id{count_.fetch_add(1)};

if (id >= per_thread_counter_.size()) {
std::terminate();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may initialize zero element in per_thread_counter_ and use it if threads count goes beyond of array. But this does not work with load/store scheme. Why are not you using fetch_add with relaxed policy?


void IncrementUnchecked(const double v) {
CacheLine& c = per_thread_counter_[ThreadId()];
const double new_value{c.v.load(std::memory_order_relaxed) + v};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to remove the load there?
We can keep a second counter (eg: Cacheline.localV) that is not atomic (and does not need to be, as it is only accessed from this thread). Only the store needs to be atomic, right?

(just passing by. thanks for the book recommendation!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants