Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance in benchmarks #386

Open
jtjeferreira opened this issue May 18, 2021 · 12 comments
Open

Performance in benchmarks #386

jtjeferreira opened this issue May 18, 2021 · 12 comments

Comments

@jtjeferreira
Copy link

Hi

I am opening this issue to document some findings about the fs2-grpc performance in this benchmark. I started this journey investigating why the akka-grpc results were so bad (https://discuss.lightbend.com/t/akka-grpc-performance-in-benchmarks/8236/) but then got curious what would be the numbers for other implementations...

The fs2-grpc implementation of the benchmark was done in this PR and the results I got were

Benchmark info:
37a7f8b Mon, 17 May 2021 16:06:05 +0100 João Ferreira scala zio-grpc implementatio
Benchmarks run: scala_fs2_bench scala_akka_bench scala_zio_bench java_hotspot_grpc_pgc_bench
GRPC_BENCHMARK_DURATION=50s
GRPC_BENCHMARK_WARMUP=5s
GRPC_SERVER_CPUS=3
GRPC_SERVER_RAM=512m
GRPC_CLIENT_CONNECTIONS=50
GRPC_CLIENT_CONCURRENCY=1000
GRPC_CLIENT_QPS=0
GRPC_CLIENT_CPUS=9
GRPC_REQUEST_PAYLOAD=100B
-----
Benchmark finished. Detailed results are located in: results/211705T162018
--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   59884 |       16.19 ms |       40.65 ms |       54.12 ms |       88.15 ms |  256.21% |     204.7 MiB |
| scala_akka         |    7031 |      141.70 ms |      281.35 ms |      368.74 ms |      592.53 ms |  294.91% |    175.44 MiB |
| scala_fs2          |    7005 |      142.20 ms |      231.57 ms |      266.35 ms |      357.07 ms |  274.57% |    351.34 MiB |
| scala_zio          |    6835 |      145.74 ms |      207.45 ms |      218.25 ms |      266.37 ms |  242.61% |    241.43 MiB |
--------------------------------------------------------------------------------------------------------------------------------

I did some profiling with JFR and wanted to share the results

The biggest problem is GC:

image

Threads look fine:
image

Memory:

image

And the culprits are scalapb.GeneratedMessageCompanion.parseFrom, fs2.grpc.server.Fs2ServerCall#sendMessage. There is also a lot of cats.effect.* stuff...

@jtjeferreira
Copy link
Author

So after “wasting” all these hours profiling, I noticed that the heap settings were not being applied. After changing that, the results are a bit better.

https://discuss.lightbend.com/t/akka-grpc-performance-in-benchmarks/8236/14

@jtjeferreira
Copy link
Author

jtjeferreira commented May 19, 2021

I was doing some more profiling after having fixed the heap settings, and even though the results are much better I noticed the usage of unsafeRunSync. (the pink in the left side)

image

I am not very experienced with cats-effect, but my understanding is that we could use Async FFI without having to call "unsafe" code

@jtjeferreira
Copy link
Author

For reference here it is the flamegraph for the java benchmark
image

The netty part is pretty similar (purple right side), but comparing with the picture from the post above then we have the cats effect threads (right side), and the ServiceBuilder Executor threads (left side)

@ahjohannessen
Copy link
Collaborator

You could try to see if it makes things faster by using runtime's compute pool as Executor by new Executor { def execute(cmd Runnable): Unit = runtime.compute.execute(cmd) }. Might make a difference.

@jtjeferreira
Copy link
Author

You could try to see if it makes things faster by using runtime's compute pool as Executor by new Executor { def execute(cmd Runnable): Unit = runtime.compute.execute(cmd) }. Might make a difference.

I tried that and even new Executor { def execute(cmd Runnable): Unit = IO.blocking(cmd.run()).runUnsafeSync }. If I recall correctly the application was being killed by OOM. I even tried upgrading to latest cats-effect in case this would make a difference, but it didnt

@ahjohannessen
Copy link
Collaborator

I did try it and memory did not go up and it was around 2k faster than otherwise. However, I suppose there is unnecessary context shifting, but not sure what is the best way to avoid that.

@jtjeferreira
Copy link
Author

I did try it and memory did not go up and it was around 2k faster than otherwise. However, I suppose there is unnecessary context shifting, but not sure what is the best way to avoid that.

Maybe I was doing something wrong, but I will try again later today and will let you know. What were the benchmark settings you were using? Meanwhile did you had a look at that unsafeRunSync if there are ways to avoid it?

@ahjohannessen
Copy link
Collaborator

ahjohannessen commented May 20, 2021

I cannot remember what I did, but tried again by allocating more CPU to see what happened:

--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| scala_fs2          |   37711 |       26.28 ms |       47.02 ms |       72.41 ms |      148.46 ms | 1087.87% |    411.78 MiB |
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Execution Parameters:
b81da51 Wed, 19 May 2021 23:36:38 +0200 GitHub Merge pull request #145 from LesnyRumcajs/harden-analysis-cleanup
- GRPC_BENCHMARK_DURATION=30s
- GRPC_BENCHMARK_WARMUP=10s
- GRPC_SERVER_CPUS=20
- GRPC_SERVER_RAM=1024m
- GRPC_CLIENT_CONNECTIONS=50
- GRPC_CLIENT_CONCURRENCY=1000
- GRPC_CLIENT_QPS=0
- GRPC_CLIENT_CPUS=9
- GRPC_REQUEST_PAYLOAD=100B
All done.

and grpc-java

--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   72310 |       12.76 ms |       23.63 ms |       34.47 ms |       80.52 ms |  574.66% |    396.55 MiB |
--------------------------------------------------------------------------------------------------------------------------------
Benchmark Execution Parameters:
b81da51 Wed, 19 May 2021 23:36:38 +0200 GitHub Merge pull request #145 from LesnyRumcajs/harden-analysis-cleanup
- GRPC_BENCHMARK_DURATION=30s
- GRPC_BENCHMARK_WARMUP=10s
- GRPC_SERVER_CPUS=20
- GRPC_SERVER_RAM=1024m
- GRPC_CLIENT_CONNECTIONS=50
- GRPC_CLIENT_CONCURRENCY=1000
- GRPC_CLIENT_QPS=0
- GRPC_CLIENT_CPUS=9
- GRPC_REQUEST_PAYLOAD=100B
All done.

Most likely context switching that is killing the performance for fs2-grpc.

@fiadliel
Copy link
Contributor

Can you have a look at #394 and see if it helps? (should slightly reduce the number of unsafeRun operations per request)

@fiadliel
Copy link
Contributor

No, never mind. There wasn't actually much to improve there.

But there is another issue: #39 -- flow control. I mentioned subtleties before, but I've completely lost context. I'll start having another look at this. But flow control is important -- right now, the "window size" for data from the client is always 1 or 0, and this could have a major impact on throughput.

@fiadliel
Copy link
Contributor

And that's not an issue in non-streaming scenarios, which is the case in the benchmark 😞
I'd better actually download the benchmark code…

@sideeffffect
Copy link
Contributor

If it is of any usefulness, Lightbend blogged about how they increased Akka gRPC performance
https://www.lightbend.com/blog/akka-grpc-update-delivers-1200-percent-performance-improvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants