Celery worker fix liveness check #148

sjoerdie · 2024-11-27T10:11:49Z

Create celery worker liveness probe that will:

Honor long running tasks (will not deem celery to be unhealthy while it is processing tasks)
Will always restart the pod / container when celery is no longer healthy / not processing any tasks for whatever reason. Error: No nodes replied within time constraint should definitely NOT update the /app/tmp/celery_worker_heartbeat.

Problem description:

There have been multiple incidents relating to the issue where celery stops processing tasks, see: celery/celery#7276

The issue was first reported for openforms: open-formulieren/open-forms#2927

A liveness probe health check was created to workaround the issue open-formulieren/open-forms#3014

This health check has now been implemented in all components using celery.

Unfortunately the health check does not always work, deeming the celery worker "healthy" while it is not. We have been unable to replicate consistently when this occurs. It seems to happen during cluster reboots or full re-deployments. This seems to be happening quite regularly for objecten.

What we have observed is that while the celery worker is no longer processing requests, confirmed by running a celery inspect command (all celery commands will fail with this error).

celery inspect active
Error: No nodes replied within time constraint

The health check will continue to update the celery_worker_heartbeat file in /app/tmp/. Therefore the celery worker is deemed to be healthy and the pod will not be restarted, requiring manual intervention after the issue is detected by the user.

Second, the health check also does not seem to honor long running tasks, causing it to restart a pod while it is processing long running tasks. This is the main reasons the health check was created and we are not just using celery inspect ping -d celery@$HOSTNAME", as described in https://medium.com/ambient-innovation/health-checks-for-celery-in-kubernetes-cf3274a3e106 .

Therefore we now advise to use the native celery ping with a high threshold / periodseconds so the celery worker had time to process longer running tasks.

livenessProbe:
    exec:
      command:
        - /bin/bash
        - -c
        - celery inspect ping --destination celery@${HOSTNAME} | grep -q OK
    initialDelaySeconds: 60
    periodSeconds: 50
    timeoutSeconds: 10
    failureThreshold: 10
    successThreshold: 1

The text was updated successfully, but these errors were encountered:

SilviaAmAm · 2024-12-27T14:41:17Z

I tried adding this probe to OAB:

livenessProbe: 
    exec:
      command:
        - /bin/sh
        - -c
        - celery --workdir src --app openarchiefbeheer.celery inspect active
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 10
    failureThreshold: 3
    successThreshold: 1

This seemed to not interrupt the long running tasks and when I shut down redis inside the pod from the redis-cli, after the redis pod was restarted, the worker was also restarted.

The command celery inspect active returns a 0 exit code if there are active workers and a non-zero exit code otherwise. So this seems to be a good possibility for a liveness probe?

sjoerdie · 2025-01-06T14:40:21Z

@alextreme Lets discuss

alextreme · 2025-01-07T10:36:26Z

Discussed, both the inspect active and ping healthchecks are necessary as they monitor and avoid 2 different failure scenario's (if the task queue is active and reachable vs if tasks can be performed)

I'd like to have both livenessprobes in each chart with celery, and have higher timeouts for the ping-healthcheck if there are longrunning celery tasks

SilviaAmAm · 2025-01-08T06:42:27Z

I am not sure I understand why the extra ping is needed. If the worker isn't working properly the active command will return a non-zero exit code. The only case I can think of where ping would fail and not active is if the worker is processing long tasks or there is a task that is hanging and the worker cannot process new tasks (so it can't respond to the ping).
But I think that this last case where a task is "hanging" should be handled by the celery timeout 🤔 Or am I missing something?

alextreme · 2025-01-08T08:21:50Z

The only case I can think of where ping would fail and not active is if the worker is processing long tasks or there is a task that is hanging and the worker cannot process new tasks (so it can't respond to the ping). But I think that this last case where a task is "hanging" should be handled by the celery timeout 🤔 Or am I missing something?

No you are not missing something, this is exactly the scenario that I have in mind. In these cases there should be a celery soft/hard timeout on the task ( https://docs.celeryq.dev/en/stable/userguide/workers.html#time-limits ), but just like setting a timeout on requests-calls it is easy to forget and has lead to issues in the past that celery workers required manual restarting.

And you can neatly put timeouts on your tasks, however third-party celery tasks can also be enabled and we don't vet these tasks for having timeouts set

SilviaAmAm · 2025-01-09T10:34:28Z

I see 🤔 but if it is forgotten, it is not difficult to add to the projects? And with the project level settings, you can set the timeout for all the tasks (so I think that this would also take care of the 3rd party libraries who run tasks?).

I am just a bit reluctant to add the ping to all projects as a blanket, because projects like OAB who have long running tasks then would struggle with the ping probe 🤔

alextreme · 2025-01-09T20:33:23Z

Well, I guess it depends how long the tasks are expected to run.

With OIP we have a 15-minute hard task limit ( https://github.com/maykinmedia/open-inwoner/blob/fbae2b82d82107a9e1e615c7ff801f851533e8c7/src/open_inwoner/conf/base.py#L687 ), however the 15-minute limit is configurable via an envvar.

Then again, due to this the need for a ping healthcheck is indeed zero.

So lets add the inspect active healthcheck to all celery-using components, and decide on a component-per-component basis which of these situations apply: a global hard task limit, or long running tasks which duration can't be properly determined a priori. If either of these are applicable, then the ping healthcheck is to be removed

SilviaAmAm · 2025-01-10T10:00:53Z

Sounds good, I will add them 👍

Done in the context of issue maykinmedia/charts#148

Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout

Done in the context of maykinmedia/charts#148

Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout

Done in the context of maykinmedia/charts#148

Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout

OAB has long running tasks (>2h), so only the celery active probe is added

The object API has a celerty task limit of 5 min (soft), 15 min (hard). So no ping was added.

Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added

OIP has a hard task limit of 15 min, so no ping probe was added

Timeouts added in maykinmedia/open-klant#316

Timeouts added in open-zaak/open-notificaties#221

Open Zaak has one long running tasks of which we don't know how long it will take, so no ping probe was added. The task: https://github.com/open-zaak/open-zaak/blob/70449f11a31a0e8a06d9e47d49e095b31d1068f6/src/openzaak/components/documenten/tasks.py#L333

OAB has long running tasks (>2h), so only the celery active probe is added

The object API has a celerty task limit of 5 min (soft), 15 min (hard). So no ping was added.

Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added

OIP has a hard task limit of 15 min, so no ping probe was added

Timeouts added in maykinmedia/open-klant#316

Timeouts added in open-zaak/open-notificaties#221

Open Zaak has one long running tasks of which we don't know how long it will take, so no ping probe was added. The task: https://github.com/open-zaak/open-zaak/blob/70449f11a31a0e8a06d9e47d49e095b31d1068f6/src/openzaak/components/documenten/tasks.py#L333

Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added

OIP has a hard task limit of 15 min, so no ping probe was added

Timeouts added in maykinmedia/open-klant#316

Timeouts added in open-zaak/open-notificaties#221

Open Zaak has one long running tasks of which we don't know how long it will take, so no ping probe was added. The task: https://github.com/open-zaak/open-zaak/blob/70449f11a31a0e8a06d9e47d49e095b31d1068f6/src/openzaak/components/documenten/tasks.py#L333

sjoerdie assigned SilviaAmAm Nov 27, 2024

SilviaAmAm added this to Open Archiefbeheer - Sprints Dec 27, 2024

SilviaAmAm moved this to In Progress in Open Archiefbeheer - Sprints Dec 27, 2024

SilviaAmAm added the discuss label Dec 27, 2024

SilviaAmAm added a commit to maykinmedia/open-klant that referenced this issue Jan 10, 2025

🔧 Add timeouts to celery tasks

b03173a

Done in the context of issue maykinmedia/charts#148

SilviaAmAm mentioned this issue Jan 10, 2025

🔧 Add timeouts to celery tasks maykinmedia/open-klant#316

Merged

SilviaAmAm added a commit to maykinmedia/open-klant that referenced this issue Jan 10, 2025

🔧 Add timeouts to celery tasks

8cb6a04

Done in the context of issue maykinmedia/charts#148

SilviaAmAm added a commit to maykinmedia/open-klant that referenced this issue Jan 10, 2025

🔧 Add timeouts to celery tasks

147de64

Done in the context of issue maykinmedia/charts#148

SilviaAmAm added a commit to maykinmedia/open-klant that referenced this issue Jan 10, 2025

🔧 Add timeouts to celery tasks

54aa761

Done in the context of issue maykinmedia/charts#148

SilviaAmAm added a commit to open-zaak/open-notificaties that referenced this issue Jan 10, 2025

🔧 Add timeout to requests for delivering notification

dbd2595

Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout

SilviaAmAm mentioned this issue Jan 10, 2025

Add celery timeouts open-zaak/open-notificaties#221

Merged

SilviaAmAm added a commit to open-zaak/open-notificaties that referenced this issue Jan 10, 2025

🔧 Add soft and hard timeout for celery tasks

01a6170

Done in the context of maykinmedia/charts#148

SilviaAmAm added a commit to open-zaak/open-notificaties that referenced this issue Jan 10, 2025

🔧 Add timeout to requests for delivering notification

dc7f5d4

Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout

SilviaAmAm added a commit to open-zaak/open-notificaties that referenced this issue Jan 10, 2025

🔧 Add soft and hard timeout for celery tasks

3b5255e

Done in the context of maykinmedia/charts#148

SilviaAmAm added a commit to open-zaak/open-notificaties that referenced this issue Jan 10, 2025

🔧 Add timeout to requests for delivering notification

99f84b2

Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update defaults liveness probe OAB

338ac8f

OAB has long running tasks (>2h), so only the celery active probe is added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

📝 [#148] Update Readme OAB

0f1ac45

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

⬆️ [#148] Bump OAB

057ad0a

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe for object API

95542ef

The object API has a celerty task limit of 5 min (soft), 15 min (hard). So no ping was added.

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Forms

f4f5047

Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Inwoner

1da5cb0

OIP has a hard task limit of 15 min, so no ping probe was added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Add livenessProbe Open Klant

1c541bc

Timeouts added in maykinmedia/open-klant#316

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Notificaties

81bd99e

Timeouts added in open-zaak/open-notificaties#221

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

📝 [#148] Add changelogs

9829933

SilviaAmAm linked a pull request Jan 10, 2025 that will close this issue

[#148] Update liveness probes #164

Open

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe OAB

f30659f

OAB has long running tasks (>2h), so only the celery active probe is added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe for object API

01160b1

The object API has a celerty task limit of 5 min (soft), 15 min (hard). So no ping was added.

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Forms

1edf9c8

Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Inwoner

aceb032

OIP has a hard task limit of 15 min, so no ping probe was added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Add livenessProbe Open Klant

b592d66

Timeouts added in maykinmedia/open-klant#316

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Notificaties

5ba89b6

Timeouts added in open-zaak/open-notificaties#221

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

📝 [#148] Add changelogs

f484d0e

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Forms

590c7b7

Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Inwoner

b900303

OIP has a hard task limit of 15 min, so no ping probe was added

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Add livenessProbe Open Klant

173c683

Timeouts added in maykinmedia/open-klant#316

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

✨ [#148] Update livenessProbe Open Notificaties

e832904

Timeouts added in open-zaak/open-notificaties#221

SilviaAmAm added a commit that referenced this issue Jan 10, 2025

📝 [#148] Add changelogs

d1f1bfe

SilviaAmAm moved this from In Progress to Has Pull Request in Open Archiefbeheer - Sprints Jan 10, 2025

SilviaAmAm removed the discuss label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery worker fix liveness check #148

Celery worker fix liveness check #148

sjoerdie commented Nov 27, 2024

SilviaAmAm commented Dec 27, 2024 •

edited

Loading

sjoerdie commented Jan 6, 2025

alextreme commented Jan 7, 2025

SilviaAmAm commented Jan 8, 2025

alextreme commented Jan 8, 2025

SilviaAmAm commented Jan 9, 2025

alextreme commented Jan 9, 2025

SilviaAmAm commented Jan 10, 2025

Celery worker fix liveness check #148

Celery worker fix liveness check #148

Comments

sjoerdie commented Nov 27, 2024

SilviaAmAm commented Dec 27, 2024 • edited Loading

sjoerdie commented Jan 6, 2025

alextreme commented Jan 7, 2025

SilviaAmAm commented Jan 8, 2025

alextreme commented Jan 8, 2025

SilviaAmAm commented Jan 9, 2025

alextreme commented Jan 9, 2025

SilviaAmAm commented Jan 10, 2025

SilviaAmAm commented Dec 27, 2024 •

edited

Loading