Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization around get for raw usage #44

Merged
merged 3 commits into from
Feb 4, 2025
Merged

Conversation

danmayer
Copy link

@danmayer danmayer commented Feb 4, 2025

OK, starting to hit a wall with most of the profile and benchmark data, but we will talk more with some YJIT folks next week.

This is a small win, but does show up in the profiles and benchmarks, and should be more significant with remote vs local memcached servers as it reduces the total bytes on the socket write and read. This performance tweak only works on the raw: true client, which is what we use.

  • multi_get goes from 1.27x slower to 1.20x slower than raw socket
  • get goes from 1.17x slower to 1.14x slower than raw socket

multi_get before:

❯❯❯$ BENCH_TARGET=get_multi RUBY_YJIT_ENABLE=1 bundle exec bin/benchmark                                      <bundler> [faster_get]
yjit: true
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
        get 100 keys    33.000 i/100ms
get 100 keys raw sock
                        39.000 i/100ms
Calculating -------------------------------------
        get 100 keys    333.944 (± 4.8%) i/s    (2.99 ms/i) -      3.333k in  10.003634s
get 100 keys raw sock
                        422.941 (± 7.3%) i/s    (2.36 ms/i) -      4.212k in  10.009880s

Comparison:
get 100 keys raw sock:      422.9 i/s
        get 100 keys:      333.9 i/s - 1.27x  slower

multi_get after:

❯❯❯$ BENCH_TARGET=get_multi RUBY_YJIT_ENABLE=1 bundle exec bin/benchmark                                    <bundler> [faster_get ●]
yjit: true
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
        get 100 keys    33.000 i/100ms
get 100 keys raw sock
                        42.000 i/100ms
Calculating -------------------------------------
        get 100 keys    344.285 (± 5.2%) i/s    (2.90 ms/i) -      3.465k in  10.092004s
get 100 keys raw sock
                        412.262 (± 5.3%) i/s    (2.43 ms/i) -      4.116k in  10.012749s

Comparison:
get 100 keys raw sock:      412.3 i/s
        get 100 keys:      344.3 i/s - 1.20x  slower

and get before:

❯❯❯$ BENCH_TARGET=get RUBY_YJIT_ENABLE=1 bundle exec bin/benchmark                                            <bundler> [faster_get]
yjit: true
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
           get dalli   352.000 i/100ms
            get sock   399.000 i/100ms
get sock non-blocking
                       342.000 i/100ms
Calculating -------------------------------------
           get dalli      3.436k (± 5.1%) i/s  (291.01 μs/i) -     34.496k in  10.066406s
            get sock      4.016k (± 3.0%) i/s  (248.99 μs/i) -     40.299k in  10.042930s
get sock non-blocking
                          3.448k (± 5.9%) i/s  (290.00 μs/i) -     34.542k in  10.046747s

Comparison:
            get sock:     4016.2 i/s
get sock non-blocking:     3448.3 i/s - 1.16x  slower
           get dalli:     3436.3 i/s - 1.17x  slower

and get after:

❯❯❯$ BENCH_TARGET=get RUBY_YJIT_ENABLE=1 bundle exec bin/benchmark                                          <bundler> [faster_get ●]
yjit: true
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
           get dalli   353.000 i/100ms
            get sock   397.000 i/100ms
get sock non-blocking
                       356.000 i/100ms
Calculating -------------------------------------
           get dalli      3.540k (± 3.5%) i/s  (282.45 μs/i) -     35.653k in  10.082393s
            get sock      4.028k (± 3.5%) i/s  (248.25 μs/i) -     40.494k in  10.064542s
get sock non-blocking
                          3.548k (± 4.3%) i/s  (281.86 μs/i) -     35.600k in  10.050622s

Comparison:
            get sock:     4028.3 i/s
get sock non-blocking:     3547.8 i/s - 1.14x  slower
           get dalli:     3540.5 i/s - 1.14x  slower


post_get_req = optimized_for_raw ? "v k q\r\n" : "v f k q\r\n"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by not having the f we save 2 bytes on all 100 get calls and we remove the bytes that includes the bitflags in the response, which in this case we do not want or need.

@@ -68,25 +74,42 @@ def read_multi_req(keys)
# VA value_length flags key
tokens = line.split
value = @connection_manager.read_exact(tokens[1].to_i)
bitflags = optimized_for_raw ? 0 : @response_processor.bitflags_from_tokens(tokens)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we didn't ask for the response to have fit flags they will not be there and will be 0, skip all the parsing. across the whole batch.

Copy link

@grcooper grcooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there test for this?

@danmayer
Copy link
Author

danmayer commented Feb 4, 2025

mostly it was driven by the benchmark and profile, I can add a test

@danmayer danmayer requested a review from grcooper February 4, 2025 17:30
@danmayer danmayer merged commit 2351639 into main Feb 4, 2025
14 checks passed
@danmayer danmayer deleted the faster_get_multi_get branch February 4, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants