Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow watchers impact PUT latency #18109

Open
4 tasks done
chaochn47 opened this issue Jun 2, 2024 · 11 comments
Open
4 tasks done

Slow watchers impact PUT latency #18109

chaochn47 opened this issue Jun 2, 2024 · 11 comments

Comments

@chaochn47
Copy link
Member

chaochn47 commented Jun 2, 2024

Bug report criteria

What happened?

Debugging a k8s scale test failures and found that applying mutation requests (write transaction) could be delayed up to 1 - 2 seconds, which breaches the upstream SLO in the clusterloader2 test SLO measurement.

cc @hakuna-matatah

Latency of processing mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes

In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day1 <= 1s

{
    "level": "info",
    "ts": "2024-05-30T07:25:09.038516Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[734230629] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" mod_revision:66816841 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" value_size:577 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" > >; read_only:false; response_revision:66831091; number_of_response:1; applied-index:66997644; }",
    "duration": "1.146904799s",
    "start": "2024-05-30T07:25:07.891574Z",
    "end": "2024-05-30T07:25:09.038479Z",
    "steps": [
        "trace[734230629] 'register request to wait'  (duration: 14.096µs)",
        "trace[734230629] 'leader propose request'  (duration: 4.609235ms)",
        "trace[734230629] 'follower receive request'  (duration: 3.222613ms)",
        "trace[734230629] 'apply start'  (duration: 8.69631ms)",
        "trace[734230629] 'compare'  (duration: 22.423µs)",
        "trace[734230629] 'check requests'  (duration: 2.474µs)",
        "trace[734230629] 'get key's previous created_revision and leaseID' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 12.374µs)",
        "trace[734230629] 'marshal mvccpb.KeyValue' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 1.589µs)",
        "trace[734230629] 'store kv pair into bolt db' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 8.567µs)",
        "trace[734230629] 'attach lease to kv pair' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 243ns)",
        "trace[734230629] 'applied transaction'  (duration: 1.466µs)",
        "trace[734230629] 'end transaction'  (duration: 1.129944403s)"
    ],
    "step_count": 12
}

The slow end transaction step was caused by the watchableStore.mutex lock was accquired by the syncWatchers process.

{
    "level": "info",
    "ts": "2024-05-31T08:06:25.099259Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[1699572871] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" mod_revision:117957991 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" value_size:565 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" > >; read_only:false; response_revision:117967637; number_of_response:1; applied-index:118404076; }",
    "duration": "1.291462327s",
    "start": "2024-05-31T08:06:23.807769Z",
    "end": "2024-05-31T08:06:25.099231Z",
    "steps": [
        "trace[1699572871] 'apply start'  (duration: 389.561483ms)",
        "trace[1699572871] 'after lock watchableStore mutex'  (duration: 897.589474ms)"
    ],
    "step_count": 2
}
{
    "level": "warn",
    "ts": "2024-05-31T08:06:23.086653Z",
    "caller": "mvcc/watchable_store.go:230",
    "msg": "slow sync watchers process",
    "took": "905.879353ms",
    "expected-duration-threshold": "100ms"
}

What did you expect to happen?

I would like the syncWatcher to be completed within 100ms and not holding the lock too long.

How can we reproduce it (as minimally and precisely as possible)?

I can work on a new benchmark cmd to simulate. As long as put enough writes (qps and throughput) to etcd and have a watch established in this key prefix, the reproduce could be archived.

Anything else we need to know?

Exploring options

  1. Increase the hardcoded chanBufLen = 128 which would put events in the sending buffer, so the watchers won't land into as unsyncd watchers in the first place.
  2. Optimize syncWatcher latency for each run, for example, limit the number of events to read from backend.
  3. Lower the frequency of running syncWatcher so the e2e latency would have less jitters and p99 value would be better.
  4. Optimize the throughput of func (sws *serverWatchStream) sendLoop() strategically.

Option 1 and 2 is helpful to cut down the mutating requests latency to 0.2s with the same k8s scale test repeated runs. Option 3 is not.

Etcd version (please run commands below)

All supported etcd versions

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@chaochn47
Copy link
Member Author

Reading trace logs is difficult if the timestamp is disordered. It was improved by #15239 but we could do better with #18108.

@serathius
Copy link
Member

serathius commented Jun 3, 2024

Please link the scalability test failure you were debugging, I'm on this SIG-scalability oncall rotation and I haven't seen any failures. Please also note that until this is proved to be a regression this cannot be treated as bug. See https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance for Kubernetes 5k node scalability testing results.

Performance improvements to slow watchers are discussed in #16839

@serathius serathius removed the type/bug label Jun 3, 2024
@chaochn47
Copy link
Member Author

chaochn47 commented Jun 3, 2024

Thanks for looking into it. The cluster set up is unique which does not segregate events from the main etcd cluster. So the existing upstream GCE and AWS 5k nodes scale tests are passing now consistently. I believe once the flag --etcd-servers-overrides is removed, test should start to fail intermittently.

I think it's worth diving into the root cause of performance degradation as it would still be helpful when etcd write qps / throughput needs to be lifted up in the future. For example, increasing the default work queue concurrent sync numbers in kube-controller-manager as it would create more mutating requests to etcd iiuc.

Please also note that until this is proved to be a regression this cannot be treated as bug

Agreed +1 That's not a bug but a report in which circumstances etcd mutating requests could take more than 1s instead. I would come up with a benchmark test to simulate the traffic with etcd test framework only.

@hakuna-matatah
Copy link

Please link the scalability test failure you were debugging, I'm on this SIG-scalability oncall rotation and I haven't seen any failures. Please also note that until this is proved to be a regression this cannot be treated as bug. See https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance for Kubernetes 5k node scalability testing results.

@serathius Yes, these tests are run internally not in upstream in a different setup mode ^^^, so which is why you wouldn't be seeing them on test-grid. I was running into performance issue from etcd on upstream load tests AWS/KOPS IIRC in the past when i didn't split the events as a separate etcd cluster. After events was split here, I didn't see this appearing on AWS/KOPS load tests which are run here periodically. I believe we should be able to reproduce performance bottleneck on upstream tests as well if we don't split events in case if we want.

The meta point here is to improve the performance/throughput so we can stretch the single cluster bit more than what it can do today.

@serathius
Copy link
Member

Thanks for looking into it. The cluster set up is unique which does not segregate events from the main etcd cluster. So the existing upstream GCE and AWS 5k nodes scale tests are passing now consistently. I believe once the flag --etcd-servers-overrides is removed, test should start to fail intermittently.

I expect this is an issue of events not using watch cache, however I don't see a reason to invest into an area that K8s has a official mitigation. If you want to remove --etcd-servers-overrides, please start form discussion in K8s.

@chaochn47
Copy link
Member Author

If you want to remove --etcd-servers-overrides, please start form discussion in K8s.

Just to clarify, remove --etcd-servers-overrides is not the intention.

We are trying to figure out the root cause how etcd handling mutating requests becomes slower than 1s with high write qps/throughput. It just happened on events and it could also happen on other resources / key prefixes.

@serathius
Copy link
Member

Right, the reason is simple, the event resource has watch cache disabled, meaning it is still vulnerable to kubernetes/kubernetes#123448. If your are not sharding K8s events out, they pollute your other resources.

@chaochn47
Copy link
Member Author

chaochn47 commented Jun 3, 2024

Right, the reason is simple, the event resource has watch cache disabled, meaning it is still vulnerable to kubernetes/kubernetes#123448. If your are not sharding K8s events out, they pollute your other resources.

Enabling watch cache only protects from N direct etcd watches polluting other resources. Problem identified in this investigation is there is only 1 direct etcd watch on events and with enough write events throughput, it pollute other resources.

Hence the following statement was raised.

It just happened on events and it could also happen on other resources / key prefixes.

I think the debate could be closed with a reproduce because it's easier for us to understand our arguments.

@chaochn47
Copy link
Member Author

chaochn47 commented Jun 4, 2024

@serathius could you please take a look at the reproduce #18121 since the work is based on your watch latency perf benchmark tool?

Edit: This is not a proper reproduce of the issue we have seen in the clusterloader2 test.

Thanks!!

/cc @hakuna-matatah @mengqiy

@serathius serathius changed the title Performance degradation with slow watchers Slow watchers impact PUT latency Jun 4, 2024
@serathius
Copy link
Member

serathius commented Jun 4, 2024

I don't think there is anything surprising about slow watchers impacting PUT, etcd notifies synchronized watchers as part of committing transaction in apply loop. I don't we change it anytime soon as it's an assumption very ingrained in the code.

@jokerwyt
Copy link

jokerwyt commented Dec 12, 2024

Just for your information, I encountered a similar issue when creating pods at a rate of 10,000 per second. Upon using ten watchers that solely monitored events, I observed a 50% decrease in ETCD PUT throughput (20000->10000).

EDIT: I am curious about the assumption @serathius mentioned. Are all watchers treated as "synchronized watchers" and should they be informed on the critical path of PUT operation? I am not familiar with that concept. Thank you!

EDIT AGAIN: I found that the throughput rises back to 18000 after I use seperate etcd client (i.e. seperated conn pools) to make PUT operations. (before that, my many go routinues use one etcd client with default setting) Seems like the issues may be related to connection isolation, or some tricky behaviors of go etcd client inner connection pool...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants