unable to create a repair plan #4220

redimp · 2025-01-20T11:35:34Z

Dear scylladb-Team, thanks for the awesome work!

We are facing the problem that the scylla-operator scylladb/scylla-operator:1.15.0 is trying to create the repair plan as configured via helm with values

repairs:
- name: "name repair"
  keyspace: ["name"]
  cron: "05 23 * * *"
  intensity: "1"
  parallel: 3
  timezone: "UTC"

and the scylladb/scylla-manager:3.4.0 fails to do so:

{
  "L": "INFO",
  "T": "2025-01-20T07:42:59.417Z",
  "N": "http",
  "M": "POST /api/v1/cluster/a56ee9b2-b094-4ff1-b5f6-2e881f13d432/tasks",
  "from": "10.42.22.218:38446",
  "status": 500,
  "bytes": 184,
  "duration": "15063ms",
  "error": "create repair target: create repair plan: calculate max host intensity: 10.43.145.249: get shard count: context canceled",
  "_trace_id": "D4Bd8m52SFyZCdRSYQ8X0g"
}

Manually creating the repair task via

sctool repair --cluster scylla/scylla

works as expected.

sctool status looks happy to me, too:

+----+------------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
|    | CQL        | REST     | Address       | Uptime     | CPUs | Memory     | Scylla | Agent | Host ID                              |
+----+------------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
| UN | UP (8ms)   | UP (0ms) | 10.43.142.221 | 308h55m31s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | c7de0c8c-4490-42ce-a42e-088b7a19a346 |
| UN | UP (105ms) | UP (0ms) | 10.43.145.249 | 289h36m45s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | d1784a53-2e45-4735-bf63-0cae13ee8cc6 |
| UN | UP (17ms)  | UP (0ms) | 10.43.145.51  | 289h36m21s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | ec0f1c78-5fb3-4fac-89cd-1e03c1c355e0 |
| UN | UP (8ms)   | UP (0ms) | 10.43.188.44  | 289h8m14s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 0895f1b8-2fc9-456f-924c-46c5bf80c178 |
| UN | UP (9ms)   | UP (0ms) | 10.43.210.186 | 306h14m21s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 767ff78c-1162-456b-a1e1-757bb1acef25 |
| UN | UP (8ms)   | UP (0ms) | 10.43.223.83  | 285h19m0s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 585ea5d4-fda1-4951-8f98-0bacfd0dbfc9 |
| UN | UP (11ms)  | UP (0ms) | 10.43.227.115 | 289h36m45s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 1108c32a-741f-4bc6-9e50-4c250c4acb22 |
| UN | UP (7ms)   | UP (0ms) | 10.43.248.47  | 289h8m12s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | b4311c62-6203-4e0a-a2a1-0e3f9220ad74 |
| UN | UP (8ms)   | UP (0ms) | 10.43.59.192  | 290h0m2s   | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 98526246-bf13-4697-9ae3-fe977b6f8c8a |
+----+------------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+

How to find the cause of the error?

The text was updated successfully, but these errors were encountered:

Michal-Leszczynski · 2025-01-21T09:12:34Z

So it looks like the node 10.43.145.249 is slow to respond.
It does not respond with shard cnt within 15s, and it's probably the timeout set on the Operator side.
The default timeout set by SM is 30s, so that's perhaps why scheduling repair with sctool works fine.
Also it has a high CQL ping time (105ms compared to average ~10ms), so maybe there is something going on with the node itself.
Anyway, if you increase the timeout on the Operator side, it should work in the same way as scheduling repair via sctool.

cc: @rzetelskik

redimp · 2025-01-21T11:01:55Z

Thanks for looking into this.

Anyway, if you increase the timeout on the Operator side, it should work in the same way as scheduling repair via sctool.

Happy to do so, I could not find a place to do this in the documentation. I am happy about a pointer.

Side note: The higher CQL ping time seems to have been a hiccup.

+----+----------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
|    | CQL      | REST     | Address       | Uptime     | CPUs | Memory     | Scylla | Agent | Host ID                              |
+----+----------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+
| UN | UP (7ms) | UP (0ms) | 10.43.142.221 | 332h24m39s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | c7de0c8c-4490-42ce-a42e-088b7a19a346 |
| UN | UP (7ms) | UP (0ms) | 10.43.145.249 | 313h5m53s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | d1784a53-2e45-4735-bf63-0cae13ee8cc6 |
| UN | UP (7ms) | UP (0ms) | 10.43.145.51  | 313h5m29s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | ec0f1c78-5fb3-4fac-89cd-1e03c1c355e0 |
| UN | UP (8ms) | UP (0ms) | 10.43.188.44  | 312h37m22s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 0895f1b8-2fc9-456f-924c-46c5bf80c178 |
| UN | UP (8ms) | UP (0ms) | 10.43.210.186 | 329h43m29s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 767ff78c-1162-456b-a1e1-757bb1acef25 |
| UN | UP (7ms) | UP (0ms) | 10.43.223.83  | 308h48m8s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 585ea5d4-fda1-4951-8f98-0bacfd0dbfc9 |
| UN | UP (7ms) | UP (0ms) | 10.43.227.115 | 313h5m53s  | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 1108c32a-741f-4bc6-9e50-4c250c4acb22 |
| UN | UP (8ms) | UP (0ms) | 10.43.248.47  | 312h37m20s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | b4311c62-6203-4e0a-a2a1-0e3f9220ad74 |
| UN | UP (9ms) | UP (0ms) | 10.43.59.192  | 313h29m10s | 96   | 125.480GiB | 6.2.0  | 3.4.1 | 98526246-bf13-4697-9ae3-fe977b6f8c8a |
+----+----------+----------+---------------+------------+------+------------+--------+-------+--------------------------------------+

Despite of the lower ping times, the error persists. And happens for different nodes, e.g.

{
  "L": "INFO",
  "T": "2025-01-21T10:47:00.623Z",
  "N": "http",
  "M": "POST /api/v1/cluster/a56ee9b2-b094-4ff1-b5f6-2e881f13d432/tasks",
  "from": "10.42.22.218:58408",
  "status": 500,
  "bytes": 183,
  "duration": "15106ms",
  "error": "create repair target: create repair plan: calculate max host intensity: 10.43.145.51: get shard count: context canceled",
  "_trace_id": "9S5Tft7qTEiC_o4H6nciPA"
}

rzetelskik · 2025-01-22T09:49:39Z

Happy to do so, I could not find a place to do this in the documentation. I am happy about a pointer.

We don't expose it - see https://github.com/scylladb/scylla-operator/blob/09076d5f1bc819a30aae71bc1c7ce07a12b877bc/pkg/cmd/operator/manager_controller.go#L128-L131 and https://github.com/scylladb/scylla-operator/blob/09076d5f1bc819a30aae71bc1c7ce07a12b877bc/pkg/controller/manager/controller.go#L35-L41.
Manager API calls are synchronous and retried internally which doesn't go well with controllers which are not supposed to wait for long running operations. I don't think raising the timeout is the way to go here. It's already set to a very high value. Something worth looking into here is why would it take >15s in the first place? @Michal-Leszczynski

@redimp in the meantime, please collect a must-gather archive so we can have some more context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to create a repair plan #4220

unable to create a repair plan #4220

redimp commented Jan 20, 2025

Michal-Leszczynski commented Jan 21, 2025

redimp commented Jan 21, 2025

rzetelskik commented Jan 22, 2025 •

edited

Loading

unable to create a repair plan #4220

unable to create a repair plan #4220

Comments

redimp commented Jan 20, 2025

Michal-Leszczynski commented Jan 21, 2025

redimp commented Jan 21, 2025

rzetelskik commented Jan 22, 2025 • edited Loading

rzetelskik commented Jan 22, 2025 •

edited

Loading