Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming after adding nodes causes to existing (not new) nodes run with 100% CPU on all shards #20921

Closed
1 of 2 tasks
juliayakovlev opened this issue Oct 1, 2024 · 47 comments

Comments

@juliayakovlev
Copy link

Packages

Scylla version: 6.3.0~dev-20240921.cd861bc78881 with build-id 66f68fcba94a28c6d0267156866992dd3da6f7a0

Kernel Version: 6.8.0-1016-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Tablets are enabled.

3 new nodes were added (in parallel)

024-09-22 08:06:59.016: (InfoEvent Severity.NORMAL) period_type=not-set event_id=bf3102db-8598-48cc-8a8a-e18fa0bdb231: message=Start grow cluster by 3 nodes
2024-09-22 08:10:59.595: (InfoEvent Severity.NORMAL) period_type=not-set event_id=7b303f0a-9c45-4e72-af3f-78c68eee4ad7: message=FinishEvent - New Nodes are up and normal

After Scylla initialisation on new added nodes the streaming was started.
All shards on existing (not new) nodes run with 100% CPU in time of streaming.
As result the query "SELECT value FROM system.scylla_local WHERE key='enabled_features'" was timed out:

2024-09-22 08:11:31.304: (DisruptionEvent Severity.ERROR) period_type=end event_id=75f2bf77-319c-4649-9590-8d4c3fc7f94a duration=35m6s: nemesis_name=GrowShrinkCluster target_node=Node elasticity-test-ubuntu-db-node-038dd3d5-3 [3.252.30.111 | 10.4.6.134] errors=Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for system.scylla_local - received only 0 responses from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5222, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4094, in disrupt_grow_shrink_cluster
    self._grow_cluster(rack=None)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4116, in _grow_cluster
    self.add_new_nodes(count=add_nodes_number, rack=rack)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 199, in wrapped
    res = func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4033, in add_new_nodes
    self._wait_for_tablets_balanced(nodes[0])
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5102, in _wait_for_tablets_balanced
    if not node.raft.is_enabled:
  File "/usr/local/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 740, in raft
    return get_raft_mode(self)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/__init__.py", line 332, in get_raft_mode
    return RaftFeature(node) if is_consistent_cluster_management_feature_enabled(session) else NoRaft(node)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/features.py", line 63, in is_consistent_cluster_management_feature_enabled
    return CONSISTENT_CLUSTER_MANAGEMENT_FEATURE in get_enabled_features(session)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/features.py", line 46, in get_enabled_features
    result = session.execute("SELECT value FROM system.scylla_local WHERE key='enabled_features'").one()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1836, in execute_verbose
    return execute_orig(*args, **kwargs)
  File "cassandra/cluster.py", line 2729, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 5120, in cassandra.cluster.ResponseFuture.result
cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for system.scylla_local - received only 0 responses from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}

reader_concurrency_semaphore on the node 2

Sep 22 08:09:47.075637 elasticity-test-ubuntu-db-node-038dd3d5-2 scylla[6273]:  [shard 3:strm] stream_session - [Stream #08517124-78ba-11ef-9f23-5ee667090e75] Start sending ks=keyspace1, cf=standard1, estimated_partitions=5080448, with new rpc streaming
Sep 22 08:09:47.109627 elasticity-test-ubuntu-db-node-038dd3d5-2 scylla[6273]:  [shard 0:strm] stream_session - [Stream #08517121-78ba-11ef-9f23-5ee667090e75] Start sending ks=keyspace1, cf=standard1, estimated_partitions=5081728, with new rpc streaming
Sep 22 08:10:01.815912 elasticity-test-ubuntu-db-node-038dd3d5-2 scylla[6273]:  [shard 0:stmt] reader_concurrency_semaphore - Semaphore user with 21/100 count and 431892/174986362 memory resources: timed out, dumping permit diagnostics:
Trigger permit: count=1, memory=17143, table=keyspace1.standard1, operation=data-query, state=active/awaitIdentified bottleneck(s): CPU

permits        count        memory        table/operation/state
20        20        406K        keyspace1.standard1/data-query/active/await
1        1        16K        keyspace1.standard1/data-query/waiting_for_execution
43372        0        0B        keyspace1.standard1/data-query/waiting_for_admission

43393        21        422K        total

Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 26497134
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 26497163
reads_enqueued_for_admission: 8887031
reads_enqueued_for_memory: 0
reads_admitted_immediately: 17653504
reads_queued_because_ready_list: 2274231
reads_queued_because_need_cpu_permits: 6612798
reads_queued_because_memory_resources: 2
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 26540535
current_permits: 43393
need_cpu_permits: 20
awaits_permits: 20
disk_reads: 20

node 3

Sep 22 08:09:47.084185 elasticity-test-ubuntu-db-node-038dd3d5-3 scylla[6285]:  [shard 1:strm] stream_session - [Stream #08517122-78ba-11ef-9f23-5ee667090e75] Start sending ks=keyspace1, cf=standard1, estimated_partitions=5076992, with new rpc streaming
Sep 22 08:11:03.332918 elasticity-test-ubuntu-db-node-038dd3d5-3 scylla[6285]:  [shard 0:strm] stream_session - [Stream #0780dab4-78ba-11ef-8e77-51c018b3284a] Streaming plan for Tablet migration-keyspace1-index-0 succeeded, peers={10.4.6.9}, tx=7444938 KiB, 95894.20 KiB/s, rx=0 KiB, 0.00 KiB/s
Sep 22 08:11:08.504189 elasticity-test-ubuntu-db-node-038dd3d5-3 scylla[6285]:  [shard 2:stmt] token_metadata - topology version 153 held for 4.984 [s] past expiry, released at: 0x6180c8e 0x61812b0 0x61815b8 0x3d7278a 0x3d658a6 0x3d10fa3 0x3003e77 0x3003193 0x30d33ae 0x5c82edf 0x5c8445a 0x5c85637 0x5ca99e0 0x5c45c1a /opt/scylladb/libreloc/libc.so.6+0x97506 /opt/scylladb/libreloc/libc.so.6+0x11b40b
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}, seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >::then_wrapped_nrvo<void, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}>(service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}&, seastar::future_state<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >&&)#1}, seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >

Existing nodes:
Image
Image
Image

New added nodes
Image
Image
Image

Impact

Query from system table fails

How frequently does it reproduce?

It happened 3 times:
1 time with 6.3.0~dev-20240921.cd861bc78881 build_id66f68fcba94a28c6d0267156866992dd3da6f7a0
2 times with 6.3.0~dev-20240929.5a470b2bfbe6 build_id a32aed9ea1b0ecae48e5ddd41ee18cc428a048e9 (https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=af267823-8cf6-4077-98ee-c637700832aa, https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=0874f7db-ab3c-4e09-a17a-0e8a4ade3b13)

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • elasticity-test-ubuntu-db-node-038dd3d5-6 (3.249.251.153 | 10.4.6.9) (shards: 7)
  • elasticity-test-ubuntu-db-node-038dd3d5-5 (52.30.199.197 | 10.4.5.105) (shards: 7)
  • elasticity-test-ubuntu-db-node-038dd3d5-4 (3.254.70.224 | 10.4.4.84) (shards: 7)
  • elasticity-test-ubuntu-db-node-038dd3d5-3 (3.252.30.111 | 10.4.6.134) (shards: 7)
  • elasticity-test-ubuntu-db-node-038dd3d5-2 (54.74.58.34 | 10.4.6.149) (shards: 7)
  • elasticity-test-ubuntu-db-node-038dd3d5-1 (18.201.8.14 | 10.4.5.49) (shards: 7)

OS / Image: ami-01c7a2af0b5876566 (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-elasticity
Test id: 038dd3d5-a7f1-4201-a760-3cd23d2492a2
Test name: scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-elasticity
Test method: performance_regression_test.PerformanceRegressionTest.test_latency_read_with_nemesis
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 038dd3d5-a7f1-4201-a760-3cd23d2492a2
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 038dd3d5-a7f1-4201-a760-3cd23d2492a2

Logs:

Jenkins job URL
Argus

@juliayakovlev juliayakovlev changed the title Streaming after adding nodes cause to existing (not new) nodes run with 100% CPU on all shards Streaming after adding nodes causes to existing (not new) nodes run with 100% CPU on all shards Oct 1, 2024
@mykaul
Copy link
Contributor

mykaul commented Oct 1, 2024

Do we see what's taking the CPU load? Any metrics for the different groups other than streaming, or is it just streaming?

@juliayakovlev
Copy link
Author

Do we see what's taking the CPU load? Any metrics for the different groups other than streaming, or is it just streaming?

There are spikes in writes per instance and row merges
Image

Image

Also compactions CPU runtime
Image

@mykaul mykaul added triage/master Looking for assignee and removed triage/oss labels Oct 6, 2024
@mykaul
Copy link
Contributor

mykaul commented Oct 6, 2024

@denesb - please assign someone to triage this.

@mykaul mykaul added the P1 Urgent label Oct 6, 2024
@mykaul mykaul added this to the 6.3 milestone Oct 6, 2024
@denesb
Copy link
Contributor

denesb commented Oct 10, 2024

@raphaelsc please have a look, this could be related to https://github.com/scylladb/scylla-enterprise/issues/4504

@raphaelsc
Copy link
Member

Image

Image

I was able to correlate discard with latency. the reason why it might be a worse problem with tablets is that topology changes can trigger tons of deletions during tablet cleanup on a particular node having many tablets being moved away from it. given the cheap cost of each cleanup, we're able to do many of them quickly which didn't happen with vnodes.

reminds of async discard discussion causing problems, which led orgs to maintain their own kernel to optionally restore old sync behavior (https://patchwork.kernel.org/project/linux-block/patch/[email protected]/). Some disks are known to have bad discard perf and the async behavior doesn't help, so we need to throttle it in our end. /cc @pwrobelse

@juliayakovlev let's retry this test with XFS mounted without online discard flag (-o discard). of course, not an actual solution (the throttling might be), but will shed some latency if discard is the source of those high latencies.

@raphaelsc
Copy link
Member

It's also possible that some of this bad write latency can be attributed to shards being overcommited due to low tablet count per shard. I found this problem elsewhere in a similar elasticity test.

@raphaelsc
Copy link
Member

@juliayakovlev any updates on this one?

@avikivity
Copy link
Member

@amnonh I tried to load the monitoring database:

./start-all.sh -v 6.3 --archive /home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53
scylla.txt not found in /home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53/. You can use it to start the monitoring stack with a given version
For example, to start the monitoring stack with version 2014.1 and manager 3.3
echo VERSIONS="2024.1">/home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53/scylla.txt
echo MANAGER_VERSION="3.3">>/home/avi/bugs/20921/monitor-set-038dd3d5/elasticity-test-ubuntu-monitor-node-038dd3d5-1/20240922T090639Z-3dbbac818d6aec53/scylla.txt

What is scylla.txt? Why is it not there?

@amnonh
Copy link
Contributor

amnonh commented Nov 4, 2024

scylla.txt was added in ScyllaDB monitoring 4.7. It's created when there is an external Prometheus directory; it holds some basic information about the last run, such as command line parameters and the Scylla version used.
If present, the --archive command line parameter would take the information from that file so you will not need to do it.

The warning appears because you are using the --archive command line flag. It's not a problem; it's just a suggestion to create such a file and make the start-all.sh simpler.

@juliayakovlev
Copy link
Author

Image

Image

I was able to correlate discard with latency. the reason why it might be a worse problem with tablets is that topology changes can trigger tons of deletions during tablet cleanup on a particular node having many tablets being moved away from it. given the cheap cost of each cleanup, we're able to do many of them quickly which didn't happen with vnodes.

reminds of async discard discussion causing problems, which led orgs to maintain their own kernel to optionally restore old sync behavior (https://patchwork.kernel.org/project/linux-block/patch/[email protected]/). Some disks are known to have bad discard perf and the async behavior doesn't help, so we need to throttle it in our end. /cc @pwrobelse

@juliayakovlev let's retry this test with XFS mounted without online discard flag (-o discard). of course, not an actual solution (the throttling might be), but will shed some latency if discard is the source of those high latencies.

@raphaelsc
Yes, I am working to run such test. To perform it I tried --online-discard 0 Scylla setup argument. But it did not disable the discard.
So I added the code that explicitly disable the discard. And it caused to the problem for Scylla start up. Checking it

@mykaul
Copy link
Contributor

mykaul commented Nov 4, 2024

Yes, I am working to run such test. To perform it I tried --online-discard 0 Scylla setup argument. But it did not disable the discard.

Please file an issue. CC @syuu1228

@juliayakovlev
Copy link
Author

juliayakovlev commented Nov 6, 2024

@raphaelsc

Recent update

I succeeded to run the test with disabled online discard one time.

Still have the problem that I do not know why it happens and how to solve it:

  1. create cluster with 3 nodes.
  2. disable online discard after Scylla installation - succeeded
  3. grow the cluster, add 3 nodes
  4. disable online discard after Scylla installation on the new added nodes - Scylla started
  5. continue with node configuration and stop/start Scylla again - then got errors:
!ERR | scylla[6923]:  [shard 0: gms] raft_topology - raft_topology_cmd wait_for_ip failed with: service::wait_for_ip_timeout (failed to obtain an IP for 7dd17e35-793b-4552-b48e-a39400804f9a in 30s)

This errors found on the existent nodes (not new added) and complain that new added node failed to obtain IP.
Despite this error the test succeeded to continue and finished successfully - issue was not reproduced.
https://argus.scylladb.com/tests/scylla-cluster-tests/37545a1d-70b8-4d64-8fbd-324748bb3313

In the previous runs Scylla failed to start up on the new added nodes on the step (5).
Startup failure:

!ERR | scylla[6930]:  [shard 0:main] init - Startup failed: std::runtime_error (the topology coordinator rejected request to join the cluster: wait_for_ip failed, error std::runtime_error (raft topology: exec_global_command(wait_for_ip) failed with std::runtime_error (failed status returned from b73c3ee7-458c-465c-906b-8578359f13b8/10.4.1.157)))

https://argus.scylladb.com/tests/scylla-cluster-tests/d6f29648-cbe7-4d7e-af33-9b5a06fadf96

NOTE: If do not run online discard disabling - startup successes and no failed to obtain an IP problem.

@swasik
Copy link

swasik commented Nov 6, 2024

@juliayakovlev could you create a bug about the wait_for_ip problem?

This errors found on the existent nodes (not new added) and complain that new added node failed to obtain IP. Despite this error the test succeeded to continue and finished successfully - issue was not reproduced.

Which exactly issue do you have in mind? I see that when adding new node latency still increases to 600 ms which is maybe better but not optimal.

@juliayakovlev
Copy link
Author

@juliayakovlev could you create a bug about the wait_for_ip problem?

This errors found on the existent nodes (not new added) and complain that new added node failed to obtain IP. Despite this error the test succeeded to continue and finished successfully - issue was not reproduced.

Which exactly issue do you have in mind? I see that when adding new node latency still increases to 600 ms which is maybe better but not optimal.

@swasik
I mean the current issue, scenario that described in description (streaming runs with 100% CPU on all shards)

@swasik
Copy link

swasik commented Nov 6, 2024

@raphaelsc so what do you recommend to do next? Should we implement throttling for deletions?

@raphaelsc
Copy link
Member

@raphaelsc so what do you recommend to do next? Should we implement throttling for deletions?

I think I should first confirm this issue is indeed related to discard, by running the test with discard disabled. Discard being disabled shouldn't cause the problem seen by @juliayakovlev

@juliayakovlev
Copy link
Author

I succeeded to run the test with disabled discard one time only. So I can't say for sure that discard disabling solve the problem

@denesb
Copy link
Contributor

denesb commented Nov 15, 2024

Still the old nodes are run with 100% CPU.

@denesb @raphaelsc @juliayakovlev who should be assigned for further investigation?

Either @raphaelsc or myself will continue the investigation, but right now both of us are busy with other high priority work.

@paszkow
Copy link
Contributor

paszkow commented Nov 18, 2024

@michoecho Please take a look and continue the investigation so we can make some progress.

@michoecho
Copy link
Contributor

michoecho commented Nov 18, 2024

@michoecho Please take a look and continue the investigation so we can make some progress.

Okay.

I was able to correlate discard with latency.
...
The test with online discard disabled:

It seems to me that this thread so far was completely off base. For what reason did we even start talking about online discard?

The main culprit seems to be bad load balancing of requests to coordinators. Observe what happens to the distribution of coordinator work at 8:10, the moment where 3 new nodes are bootstrapped.

Before the bootstrap, the distribution among the original 3 nodes is 33%:33%:33%. Immediately after the topology changes, it becomes 77%:11%:11%. The 77% node is CPU-overloaded, and can't handle the incoming throughput, which fails the test.

(Note that this is all very sensitive to the actual numbers. If the incoming throughput was low enough, the test could have survived. If it was high enough, the test could fail at the same stage even if everything works as expected).

Image

And I have an impression that this is the third time I'm looking at the same issue, probably even in the same test.

@mykaul
Copy link
Contributor

mykaul commented Nov 18, 2024

@michoecho - #19107 (comment) perhaps?

@michoecho
Copy link
Contributor

@michoecho - #19107 (comment) perhaps?

Yes, I was just about to link #19107.

@swasik
Copy link

swasik commented Nov 18, 2024

Thanks for the update! So let us wait for #19107 to see if it helps.

@mykaul
Copy link
Contributor

mykaul commented Nov 18, 2024

Thanks for the update! So let us wait for #19107 to see if it helps.

We could test with a different workload generator that uses some other driver / some other policy, such as sqlstress and scylla-bench.

@swasik
Copy link

swasik commented Nov 18, 2024

Thanks for the update! So let us wait for #19107 to see if it helps.

We could test with a different workload generator that uses some other driver / some other policy, such as sqlstress and scylla-bench.

Correct, but I already pinged Wojciech about #19107 and he says the fix should be ready today/tomorrow so probably does not make sense to add extra work.

@swasik
Copy link

swasik commented Nov 19, 2024

@juliayakovlev The potential fix for this is now merged in c-s repo (scylladb/cassandra-stress#32). Could you rerun the test making sure that the fix is included?

@mykaul
Copy link
Contributor

mykaul commented Nov 19, 2024

@juliayakovlev The potential fix for this is now merged in c-s repo (scylladb/cassandra-stress#32). Could you rerun the test making sure that the fix is included?

@swasik - I think we need a bit longer cycle here - we need a c-s release, then we need it somehow either in a container or a RPM or DEB or whatnot to be installed.

@swasik
Copy link

swasik commented Nov 20, 2024

You are right, we have to wait for scylladb/scylla-cluster-tests#9296.

@fruch
Copy link
Contributor

fruch commented Nov 21, 2024

You are right, we have to wait for scylladb/scylla-cluster-tests#9296.

from the looks of it so far, it doesn't seem to helping at

@juliayakovlev you did run it already, right ?

@juliayakovlev
Copy link
Author

You are right, we have to wait for scylladb/scylla-cluster-tests#9296.

from the looks of it so far, it doesn't seem to helping at

@juliayakovlev you did run it already, right ?

I ran elasticity test with 3.16.0 c-s version: https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2490768963

@mykaul
Copy link
Contributor

mykaul commented Nov 21, 2024

You are right, we have to wait for scylladb/scylla-cluster-tests#9296.

from the looks of it so far, it doesn't seem to helping at
@juliayakovlev you did run it already, right ?

I ran elasticity test with 3.16.0 c-s version: scylladb/scylla-enterprise#4504 (comment)

I don't think it's in 3.16.0 though, looking @ scylladb/cassandra-stress@v3.16.0...master - can you please verify?

@michoecho
Copy link
Contributor

michoecho commented Nov 21, 2024

I don't think it's in 3.16.0 though, looking @ scylladb/[email protected] - can you please verify?

@mykaul Considering that the relevant commit is physically younger than the release (2024-11-19 vs 2024-11-18), it probably isn't in the release.

But, by the way, while the coordinator imbalance is the direct trigger for the failure of the test, it's not the only big issue. The next one is replica imbalance. For example, in run e7d86441-8c7d-4798-9348-ea8924a2f608 (from https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2490768963), look at the time range between 17:00 and 18:00. This is where the test is running a read-only load on a freshly-populated cluster, before any topology operations. Coordinators are balanced, (the coordinator balance only breaks down at 18:00 after topology operations, due to the java driver problem), but replicas are very unbalanced. The difference in load between shards should be negligible, but instead the difference between the most-loaded and least-loaded shard is 14k/s vs 5 k/s. The test survives this (with difficulty), but it's (obviously?) not acceptable.

Image

In other words, fixing the client issue might fix the timeouts, but there are more problems to fix.

@swasik
Copy link

swasik commented Nov 21, 2024

I don't think it's in 3.16.0 though, looking @ scylladb/[email protected] - can you please verify?

It seems that the fix was included in 3.17, not 3.16. @juliayakovlev could you retry with 3.17?

@mykaul
Copy link
Contributor

mykaul commented Nov 21, 2024

@michoecho - could the issue you described @ #20921 (comment) be explained by low tablet count?

@michoecho
Copy link
Contributor

michoecho commented Nov 21, 2024

@michoecho - could the issue you described @ #20921 (comment) be explained by low tablet count?

No.

Low tablet count can cause balance issues proportional to the difference in tablet count. For example, if one shard replicates two tablets and another shard replicates one tablet, it makes sense that the first shard receives 2x as many requests as the second shard.

But in this case the inherent imbalance due to "low tablet count" is 18 tablets on the least-loaded shard vs 19 tablets on the most-loaded shard. That can explain a 5% imbalance in replica reads, but not 300%.

I'd guess this is a dumber problem. With tablets, does Scylla even try to balance reads across replicas?

Perhaps storage_proxy just directs the requests to the first two replicas listed for the given key. With vnodes, the listed replicas are naturally "shuffled" across different token ranges with the same replica set, so taking the first two replicas naturally results in good balance, so storage_proxy's method was good enough. But perhaps with tablets the order of replicas isn't shuffled, but sorted or completely accidental, so replica requests aren't balanced at all?

@denesb
Copy link
Contributor

denesb commented Nov 21, 2024

@michoecho - could the issue you described @ #20921 (comment) be explained by low tablet count?

No.

Low tablet count can cause balance issues proportional to the difference in tablet count. For example, if one shard replicates two tablets and another shard replicates one tablet, it makes sense that the first shard receives 2x as many requests as the second shard.

But in this case the inherent imbalance due to "low tablet count" is 18 tablets on the least-loaded shard vs 19 tablets on the most-loaded shard. That can explain a 5% imbalance in replica reads, but not 300%.

I'd guess this is a dumber problem. With tablets, does Scylla even try to balance reads across replicas?

Perhaps storage_proxy just directs the requests to the first two replicas listed for the given key. With vnodes, the listed replicas are naturally "shuffled" across different token ranges with the same replica set, so taking the first two replicas naturally results in good balance, so storage_proxy's method was good enough. But perhaps with tablets the order of replicas isn't shuffled, but sorted or completely accidental, so replica requests aren't balanced at all?

@michoecho note that even if all shards across all nodes have the exact same amount of tablets, there can still be imbalance due to the partitions sizes not being uniform and/or their distribution across the tablets not being uniform.
I investigated an issue today, where the tablet imbalance was +/- 1 like here, but still one node was suffering because it had tablets which had more data in them. The imbalance was obvious once we looked at the data size on disk.

Larger number of tablets help because the odds of tablet being significantly different in size is reduced.

@michoecho
Copy link
Contributor

michoecho commented Nov 21, 2024

@michoecho - could the issue you described @ #20921 (comment) be explained by low tablet count?

No.
Low tablet count can cause balance issues proportional to the difference in tablet count. For example, if one shard replicates two tablets and another shard replicates one tablet, it makes sense that the first shard receives 2x as many requests as the second shard.
But in this case the inherent imbalance due to "low tablet count" is 18 tablets on the least-loaded shard vs 19 tablets on the most-loaded shard. That can explain a 5% imbalance in replica reads, but not 300%.
I'd guess this is a dumber problem. With tablets, does Scylla even try to balance reads across replicas?
Perhaps storage_proxy just directs the requests to the first two replicas listed for the given key. With vnodes, the listed replicas are naturally "shuffled" across different token ranges with the same replica set, so taking the first two replicas naturally results in good balance, so storage_proxy's method was good enough. But perhaps with tablets the order of replicas isn't shuffled, but sorted or completely accidental, so replica requests aren't balanced at all?

@michoecho note that even if all shards across all nodes have the exact same amount of tablets, there can still be imbalance due to the partitions sizes not being uniform and/or their distribution across the tablets not being uniform. I investigated an issue today, where the tablet imbalance was +/- 1 like here, but still one node was suffering because it had tablets which had more data in them. The imbalance was obvious once we looked at the data size on disk.

Larger number of tablets help because the odds of tablet being significantly different in size is reduced.

@denesb That might be true in general, but it doesn't apply to this test. Here we are dealing with just the usual cassandra-stress load. All partitions are small and equally-sized. In those conditions request load is balanced across tablets near-uniformly. Scylla should translate that to a good balance of replica requests across shards, but it doesn't.

@bhalevy
Copy link
Member

bhalevy commented Nov 26, 2024

@swasik asked @juliayakovlev to retest this with the fix to c-stress.

@juliayakovlev
Copy link
Author

@swasik asked @juliayakovlev to retest this with the fix to c-stress.

Original (from this test) ami does not exist anymore. I run the test with cassandra-stress 3.17.0 with Scylla version 6.3.0~dev-20241020

@juliayakovlev
Copy link
Author

@swasik asked @juliayakovlev to retest this with the fix to c-stress.

@swasik @bhalevy

Test results with 3.17.0 c-s version

https://argus.scylladb.com/tests/scylla-cluster-tests/32cbafca-e613-4545-a649-93798dfc6c05

The existing nodes (not new added) is still run with about 100%. ReadTimeout error is not reproduced

Image

@swasik
Copy link

swasik commented Dec 2, 2024

The existing nodes (not new added) is still run with about 100%. ReadTimeout error is not reproduced

Great, so this is in line with what @michoecho predicted in #20921 (comment) and can suggest that at least the problem in the driver is solved. Now we have to more issues to solve as suggested in https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2499242192.

@bhalevy
Copy link
Member

bhalevy commented Dec 3, 2024

@swasik any reason to keep this issue open then?
(while we have scylladb/scylla-enterprise#4504 to track the scylla-side issue)

@swasik
Copy link

swasik commented Dec 3, 2024

@swasik any reason to keep this issue open then? (while we have scylladb/scylla-enterprise#4504 to track the scylla-side issue)

I think we can close it and keep tracking the remaining performance issues in https://github.com/scylladb/scylla-enterprise/issues/4504.

@swasik swasik closed this as completed Dec 3, 2024
@juliayakovlev
Copy link
Author

Reproduced again and caused to very high latency:

Image

Image

Packages

Scylla version: 6.3.0~dev-20241129.65949ce60780 with build-id d0921e78443678667ebaf5d8cdfda19428d03e6c

Kernel Version: 6.8.0-1019-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • elasticity-test-ubuntu-db-node-75894e25-6 (34.207.193.230 | 10.12.7.157) (shards: 7)
  • elasticity-test-ubuntu-db-node-75894e25-5 (3.95.168.5 | 10.12.7.200) (shards: 7)
  • elasticity-test-ubuntu-db-node-75894e25-4 (54.204.221.148 | 10.12.5.123) (shards: 7)
  • elasticity-test-ubuntu-db-node-75894e25-3 (44.207.8.144 | 10.12.7.105) (shards: 7)
  • elasticity-test-ubuntu-db-node-75894e25-2 (54.234.247.48 | 10.12.7.147) (shards: 7)
  • elasticity-test-ubuntu-db-node-75894e25-1 (3.95.225.180 | 10.12.6.41) (shards: 7)

OS / Image: ami-09535e6a78b3a7b32 (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-elasticity
Test id: 75894e25-15cb-4919-a57c-78cf6756fdb8
Test name: scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-elasticity
Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 75894e25-15cb-4919-a57c-78cf6756fdb8
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 75894e25-15cb-4919-a57c-78cf6756fdb8

Logs:

Jenkins job URL
Argus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests