-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming after adding nodes causes to existing (not new) nodes run with 100% CPU on all shards #20921
Comments
Do we see what's taking the CPU load? Any metrics for the different groups other than streaming, or is it just streaming? |
@denesb - please assign someone to triage this. |
@raphaelsc please have a look, this could be related to https://github.com/scylladb/scylla-enterprise/issues/4504 |
I was able to correlate discard with latency. the reason why it might be a worse problem with tablets is that topology changes can trigger tons of deletions during tablet cleanup on a particular node having many tablets being moved away from it. given the cheap cost of each cleanup, we're able to do many of them quickly which didn't happen with vnodes. reminds of async discard discussion causing problems, which led orgs to maintain their own kernel to optionally restore old sync behavior (https://patchwork.kernel.org/project/linux-block/patch/[email protected]/). Some disks are known to have bad discard perf and the async behavior doesn't help, so we need to throttle it in our end. /cc @pwrobelse @juliayakovlev let's retry this test with XFS mounted without online discard flag (-o discard). of course, not an actual solution (the throttling might be), but will shed some latency if discard is the source of those high latencies. |
It's also possible that some of this bad write latency can be attributed to shards being overcommited due to low tablet count per shard. I found this problem elsewhere in a similar elasticity test. |
@juliayakovlev any updates on this one? |
@amnonh I tried to load the monitoring database: ./start-all.sh -v 6.3 --archive /home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53
scylla.txt not found in /home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53/. You can use it to start the monitoring stack with a given version
For example, to start the monitoring stack with version 2014.1 and manager 3.3
echo VERSIONS="2024.1">/home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53/scylla.txt
echo MANAGER_VERSION="3.3">>/home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53/scylla.txt What is scylla.txt? Why is it not there? |
scylla.txt was added in ScyllaDB monitoring 4.7. It's created when there is an external Prometheus directory; it holds some basic information about the last run, such as command line parameters and the Scylla version used. The warning appears because you are using the |
@raphaelsc |
Please file an issue. CC @syuu1228 |
Recent update I succeeded to run the test with disabled Still have the problem that I do not know why it happens and how to solve it:
This errors found on the existent nodes (not new added) and complain that new added node failed to obtain IP. In the previous runs Scylla failed to start up on the new added nodes on the step (5).
https://argus.scylladb.com/tests/scylla-cluster-tests/d6f29648-cbe7-4d7e-af33-9b5a06fadf96 NOTE: If do not run online discard disabling - startup successes and no |
@juliayakovlev could you create a bug about the wait_for_ip problem?
Which exactly issue do you have in mind? I see that when adding new node latency still increases to 600 ms which is maybe better but not optimal. |
@swasik |
@raphaelsc so what do you recommend to do next? Should we implement throttling for deletions? |
I think I should first confirm this issue is indeed related to discard, by running the test with discard disabled. Discard being disabled shouldn't cause the problem seen by @juliayakovlev |
I succeeded to run the test with disabled discard one time only. So I can't say for sure that discard disabling solve the problem |
Either @raphaelsc or myself will continue the investigation, but right now both of us are busy with other high priority work. |
@michoecho Please take a look and continue the investigation so we can make some progress. |
Okay.
It seems to me that this thread so far was completely off base. For what reason did we even start talking about online discard? The main culprit seems to be bad load balancing of requests to coordinators. Observe what happens to the distribution of coordinator work at 8:10, the moment where 3 new nodes are bootstrapped. Before the bootstrap, the distribution among the original 3 nodes is 33%:33%:33%. Immediately after the topology changes, it becomes 77%:11%:11%. The 77% node is CPU-overloaded, and can't handle the incoming throughput, which fails the test. (Note that this is all very sensitive to the actual numbers. If the incoming throughput was low enough, the test could have survived. If it was high enough, the test could fail at the same stage even if everything works as expected). And I have an impression that this is the third time I'm looking at the same issue, probably even in the same test. |
@michoecho - #19107 (comment) perhaps? |
Yes, I was just about to link #19107. |
Thanks for the update! So let us wait for #19107 to see if it helps. |
We could test with a different workload generator that uses some other driver / some other policy, such as sqlstress and scylla-bench. |
Correct, but I already pinged Wojciech about #19107 and he says the fix should be ready today/tomorrow so probably does not make sense to add extra work. |
@juliayakovlev The potential fix for this is now merged in c-s repo (scylladb/cassandra-stress#32). Could you rerun the test making sure that the fix is included? |
@swasik - I think we need a bit longer cycle here - we need a c-s release, then we need it somehow either in a container or a RPM or DEB or whatnot to be installed. |
You are right, we have to wait for scylladb/scylla-cluster-tests#9296. |
from the looks of it so far, it doesn't seem to helping at @juliayakovlev you did run it already, right ? |
I ran elasticity test with |
I don't think it's in 3.16.0 though, looking @ scylladb/cassandra-stress@v3.16.0...master - can you please verify? |
@mykaul Considering that the relevant commit is physically younger than the release (2024-11-19 vs 2024-11-18), it probably isn't in the release. But, by the way, while the coordinator imbalance is the direct trigger for the failure of the test, it's not the only big issue. The next one is replica imbalance. For example, in run e7d86441-8c7d-4798-9348-ea8924a2f608 (from https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2490768963), look at the time range between 17:00 and 18:00. This is where the test is running a read-only load on a freshly-populated cluster, before any topology operations. Coordinators are balanced, (the coordinator balance only breaks down at 18:00 after topology operations, due to the java driver problem), but replicas are very unbalanced. The difference in load between shards should be negligible, but instead the difference between the most-loaded and least-loaded shard is 14k/s vs 5 k/s. The test survives this (with difficulty), but it's (obviously?) not acceptable. In other words, fixing the client issue might fix the timeouts, but there are more problems to fix. |
It seems that the fix was included in 3.17, not 3.16. @juliayakovlev could you retry with 3.17? |
@michoecho - could the issue you described @ #20921 (comment) be explained by low tablet count? |
No. Low tablet count can cause balance issues proportional to the difference in tablet count. For example, if one shard replicates two tablets and another shard replicates one tablet, it makes sense that the first shard receives 2x as many requests as the second shard. But in this case the inherent imbalance due to "low tablet count" is 18 tablets on the least-loaded shard vs 19 tablets on the most-loaded shard. That can explain a 5% imbalance in replica reads, but not 300%. I'd guess this is a dumber problem. With tablets, does Scylla even try to balance reads across replicas? Perhaps |
@michoecho note that even if all shards across all nodes have the exact same amount of tablets, there can still be imbalance due to the partitions sizes not being uniform and/or their distribution across the tablets not being uniform. Larger number of tablets help because the odds of tablet being significantly different in size is reduced. |
@denesb That might be true in general, but it doesn't apply to this test. Here we are dealing with just the usual cassandra-stress load. All partitions are small and equally-sized. In those conditions request load is balanced across tablets near-uniformly. Scylla should translate that to a good balance of replica requests across shards, but it doesn't. |
@swasik asked @juliayakovlev to retest this with the fix to c-stress. |
Original (from this test) ami does not exist anymore. I run the test with cassandra-stress |
Test results with https://argus.scylladb.com/tests/scylla-cluster-tests/32cbafca-e613-4545-a649-93798dfc6c05 The existing nodes (not new added) is still run with about 100%. ReadTimeout error is not reproduced |
Great, so this is in line with what @michoecho predicted in #20921 (comment) and can suggest that at least the problem in the driver is solved. Now we have to more issues to solve as suggested in https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2499242192. |
@swasik any reason to keep this issue open then? |
I think we can close it and keep tracking the remaining performance issues in https://github.com/scylladb/scylla-enterprise/issues/4504. |
Reproduced again and caused to very high latency: PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 3 nodes (i4i.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Packages
Scylla version:
6.3.0~dev-20240921.cd861bc78881
with build-id66f68fcba94a28c6d0267156866992dd3da6f7a0
Kernel Version:
6.8.0-1016-aws
Issue description
Tablets are enabled.
3 new nodes were added (in parallel)
After Scylla initialisation on new added nodes the streaming was started.
All shards on existing (not new) nodes run with 100% CPU in time of streaming.
As result the query
"SELECT value FROM system.scylla_local WHERE key='enabled_features'"
was timed out:reader_concurrency_semaphore
on the node 2node 3
Existing nodes:
New added nodes
Impact
Query from system table fails
How frequently does it reproduce?
It happened 3 times:
1 time with
6.3.0~dev-20240921.cd861bc78881 build_id66f68fcba94a28c6d0267156866992dd3da6f7a0
2 times with
6.3.0~dev-20240929.5a470b2bfbe6 build_id a32aed9ea1b0ecae48e5ddd41ee18cc428a048e9
(https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=af267823-8cf6-4077-98ee-c637700832aa, https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=0874f7db-ab3c-4e09-a17a-0e8a4ade3b13)Installation details
Cluster size: 3 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-01c7a2af0b5876566
(aws: undefined_region)Test:
scylla-master-perf-regression-latency-650gb-elasticity
Test id:
038dd3d5-a7f1-4201-a760-3cd23d2492a2
Test name:
scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-elasticity
Test method:
performance_regression_test.PerformanceRegressionTest.test_latency_read_with_nemesis
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 038dd3d5-a7f1-4201-a760-3cd23d2492a2
$ hydra investigate show-logs 038dd3d5-a7f1-4201-a760-3cd23d2492a2
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: