Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to create a repair plan #4220

Open
redimp opened this issue Jan 20, 2025 · 3 comments
Open

unable to create a repair plan #4220

redimp opened this issue Jan 20, 2025 · 3 comments

Comments

@redimp
Copy link

redimp commented Jan 20, 2025

Dear scylladb-Team, thanks for the awesome work!

We are facing the problem that the scylla-operator scylladb/scylla-operator:1.15.0 is trying to create the repair plan as configured via helm with values

repairs:
- name: "name repair"
  keyspace: ["name"]
  cron: "05 23 * * *"
  intensity: "1"
  parallel: 3
  timezone: "UTC"

and the scylladb/scylla-manager:3.4.0 fails to do so:

{
  "L": "INFO",
  "T": "2025-01-20T07:42:59.417Z",
  "N": "http",
  "M": "POST /api/v1/cluster/a56ee9b2-b094-4ff1-b5f6-2e881f13d432/tasks",
  "from": "10.42.22.218:38446",
  "status": 500,
  "bytes": 184,
  "duration": "15063ms",
  "error": "create repair target: create repair plan: calculate max host intensity: 10.43.145.249: get shard count: context canceled",
  "_trace_id": "D4Bd8m52SFyZCdRSYQ8X0g"
}

Manually creating the repair task via

sctool repair --cluster scylla/scylla

works as expected.

sctool status looks happy to me, too:

+----+------------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
|    | CQL        | REST     | Address       | Uptime     | CPUs | Memory     | Scylla | Agent | Host ID                              |
+----+------------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
| UN | UP (8ms)   | UP (0ms) | 10.43.142.221 | 308h55m31s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | c7de0c8c-4490-42ce-a42e-088b7a19a346 |
| UN | UP (105ms) | UP (0ms) | 10.43.145.249 | 289h36m45s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | d1784a53-2e45-4735-bf63-0cae13ee8cc6 |
| UN | UP (17ms)  | UP (0ms) | 10.43.145.51  | 289h36m21s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | ec0f1c78-5fb3-4fac-89cd-1e03c1c355e0 |
| UN | UP (8ms)   | UP (0ms) | 10.43.188.44  | 289h8m14s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 0895f1b8-2fc9-456f-924c-46c5bf80c178 |
| UN | UP (9ms)   | UP (0ms) | 10.43.210.186 | 306h14m21s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 767ff78c-1162-456b-a1e1-757bb1acef25 |
| UN | UP (8ms)   | UP (0ms) | 10.43.223.83  | 285h19m0s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 585ea5d4-fda1-4951-8f98-0bacfd0dbfc9 |
| UN | UP (11ms)  | UP (0ms) | 10.43.227.115 | 289h36m45s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 1108c32a-741f-4bc6-9e50-4c250c4acb22 |
| UN | UP (7ms)   | UP (0ms) | 10.43.248.47  | 289h8m12s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | b4311c62-6203-4e0a-a2a1-0e3f9220ad74 |
| UN | UP (8ms)   | UP (0ms) | 10.43.59.192  | 290h0m2s   | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 98526246-bf13-4697-9ae3-fe977b6f8c8a |
+----+------------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+

How to find the cause of the error?

@Michal-Leszczynski
Copy link
Collaborator

So it looks like the node 10.43.145.249 is slow to respond.
It does not respond with shard cnt within 15s, and it's probably the timeout set on the Operator side.
The default timeout set by SM is 30s, so that's perhaps why scheduling repair with sctool works fine.
Also it has a high CQL ping time (105ms compared to average ~10ms), so maybe there is something going on with the node itself.
Anyway, if you increase the timeout on the Operator side, it should work in the same way as scheduling repair via sctool.

cc: @rzetelskik

@redimp
Copy link
Author

redimp commented Jan 21, 2025

Thanks for looking into this.

Anyway, if you increase the timeout on the Operator side, it should work in the same way as scheduling repair via sctool.

Happy to do so, I could not find a place to do this in the documentation. I am happy about a pointer.

Side note: The higher CQL ping time seems to have been a hiccup.

+----+----------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
|    | CQL      | REST     | Address       | Uptime     | CPUs | Memory     | Scylla | Agent | Host ID                              |
+----+----------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
| UN | UP (7ms) | UP (0ms) | 10.43.142.221 | 332h24m39s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | c7de0c8c-4490-42ce-a42e-088b7a19a346 |
| UN | UP (7ms) | UP (0ms) | 10.43.145.249 | 313h5m53s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | d1784a53-2e45-4735-bf63-0cae13ee8cc6 |
| UN | UP (7ms) | UP (0ms) | 10.43.145.51  | 313h5m29s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | ec0f1c78-5fb3-4fac-89cd-1e03c1c355e0 |
| UN | UP (8ms) | UP (0ms) | 10.43.188.44  | 312h37m22s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 0895f1b8-2fc9-456f-924c-46c5bf80c178 |
| UN | UP (8ms) | UP (0ms) | 10.43.210.186 | 329h43m29s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 767ff78c-1162-456b-a1e1-757bb1acef25 |
| UN | UP (7ms) | UP (0ms) | 10.43.223.83  | 308h48m8s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 585ea5d4-fda1-4951-8f98-0bacfd0dbfc9 |
| UN | UP (7ms) | UP (0ms) | 10.43.227.115 | 313h5m53s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 1108c32a-741f-4bc6-9e50-4c250c4acb22 |
| UN | UP (8ms) | UP (0ms) | 10.43.248.47  | 312h37m20s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | b4311c62-6203-4e0a-a2a1-0e3f9220ad74 |
| UN | UP (9ms) | UP (0ms) | 10.43.59.192  | 313h29m10s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 98526246-bf13-4697-9ae3-fe977b6f8c8a |
+----+----------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+

Despite of the lower ping times, the error persists. And happens for different nodes, e.g.

{
  "L": "INFO",
  "T": "2025-01-21T10:47:00.623Z",
  "N": "http",
  "M": "POST /api/v1/cluster/a56ee9b2-b094-4ff1-b5f6-2e881f13d432/tasks",
  "from": "10.42.22.218:58408",
  "status": 500,
  "bytes": 183,
  "duration": "15106ms",
  "error": "create repair target: create repair plan: calculate max host intensity: 10.43.145.51: get shard count: context canceled",
  "_trace_id": "9S5Tft7qTEiC_o4H6nciPA"
}

@rzetelskik
Copy link
Member

rzetelskik commented Jan 22, 2025

Happy to do so, I could not find a place to do this in the documentation. I am happy about a pointer.

We don't expose it - see https://github.com/scylladb/scylla-operator/blob/09076d5f1bc819a30aae71bc1c7ce07a12b877bc/pkg/cmd/operator/manager_controller.go#L128-L131 and https://github.com/scylladb/scylla-operator/blob/09076d5f1bc819a30aae71bc1c7ce07a12b877bc/pkg/controller/manager/controller.go#L35-L41.
Manager API calls are synchronous and retried internally which doesn't go well with controllers which are not supposed to wait for long running operations. I don't think raising the timeout is the way to go here. It's already set to a very high value. Something worth looking into here is why would it take >15s in the first place? @Michal-Leszczynski

@redimp in the meantime, please collect a must-gather archive so we can have some more context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants