-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery worker fix liveness check #148
Comments
I tried adding this probe to OAB: livenessProbe:
exec:
command:
- /bin/sh
- -c
- celery --workdir src --app openarchiefbeheer.celery inspect active
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1 This seemed to not interrupt the long running tasks and when I shut down redis inside the pod from the redis-cli, after the redis pod was restarted, the worker was also restarted. The command |
@alextreme Lets discuss |
Discussed, both the inspect active and ping healthchecks are necessary as they monitor and avoid 2 different failure scenario's (if the task queue is active and reachable vs if tasks can be performed) I'd like to have both livenessprobes in each chart with celery, and have higher timeouts for the ping-healthcheck if there are longrunning celery tasks |
I am not sure I understand why the extra ping is needed. If the worker isn't working properly the |
No you are not missing something, this is exactly the scenario that I have in mind. In these cases there should be a celery soft/hard timeout on the task ( https://docs.celeryq.dev/en/stable/userguide/workers.html#time-limits ), but just like setting a timeout on requests-calls it is easy to forget and has lead to issues in the past that celery workers required manual restarting. And you can neatly put timeouts on your tasks, however third-party celery tasks can also be enabled and we don't vet these tasks for having timeouts set |
I see 🤔 but if it is forgotten, it is not difficult to add to the projects? And with the project level settings, you can set the timeout for all the tasks (so I think that this would also take care of the 3rd party libraries who run tasks?). I am just a bit reluctant to add the ping to all projects as a blanket, because projects like OAB who have long running tasks then would struggle with the ping probe 🤔 |
Well, I guess it depends how long the tasks are expected to run. With OIP we have a 15-minute hard task limit ( https://github.com/maykinmedia/open-inwoner/blob/fbae2b82d82107a9e1e615c7ff801f851533e8c7/src/open_inwoner/conf/base.py#L687 ), however the 15-minute limit is configurable via an envvar. Then again, due to this the need for a ping healthcheck is indeed zero. So lets add the inspect active healthcheck to all celery-using components, and decide on a component-per-component basis which of these situations apply: a global hard task limit, or long running tasks which duration can't be properly determined a priori. If either of these are applicable, then the ping healthcheck is to be removed |
Sounds good, I will add them 👍 |
Done in the context of issue maykinmedia/charts#148
Done in the context of issue maykinmedia/charts#148
Done in the context of issue maykinmedia/charts#148
Done in the context of issue maykinmedia/charts#148
Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout
Done in the context of maykinmedia/charts#148
Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout
Done in the context of maykinmedia/charts#148
Done in the context of maykinmedia/charts#148. This is to prevent the task to hang and get killed by the Celery timeout
OAB has long running tasks (>2h), so only the celery active probe is added
The object API has a celerty task limit of 5 min (soft), 15 min (hard). So no ping was added.
Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added
OIP has a hard task limit of 15 min, so no ping probe was added
Timeouts added in maykinmedia/open-klant#316
Open Zaak has one long running tasks of which we don't know how long it will take, so no ping probe was added. The task: https://github.com/open-zaak/open-zaak/blob/70449f11a31a0e8a06d9e47d49e095b31d1068f6/src/openzaak/components/documenten/tasks.py#L333
OAB has long running tasks (>2h), so only the celery active probe is added
The object API has a celerty task limit of 5 min (soft), 15 min (hard). So no ping was added.
Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added
OIP has a hard task limit of 15 min, so no ping probe was added
Timeouts added in maykinmedia/open-klant#316
Open Zaak has one long running tasks of which we don't know how long it will take, so no ping probe was added. The task: https://github.com/open-zaak/open-zaak/blob/70449f11a31a0e8a06d9e47d49e095b31d1068f6/src/openzaak/components/documenten/tasks.py#L333
Open forms has task limits (5 (soft), 15 (hard) minutes) so no ping was added
OIP has a hard task limit of 15 min, so no ping probe was added
Timeouts added in maykinmedia/open-klant#316
Open Zaak has one long running tasks of which we don't know how long it will take, so no ping probe was added. The task: https://github.com/open-zaak/open-zaak/blob/70449f11a31a0e8a06d9e47d49e095b31d1068f6/src/openzaak/components/documenten/tasks.py#L333
Create celery worker liveness probe that will:
Error: No nodes replied within time constraint
should definitely NOT update the/app/tmp/celery_worker_heartbeat
.Problem description:
There have been multiple incidents relating to the issue where celery stops processing tasks, see: celery/celery#7276
The issue was first reported for openforms: open-formulieren/open-forms#2927
A liveness probe health check was created to workaround the issue open-formulieren/open-forms#3014
This health check has now been implemented in all components using celery.
Unfortunately the health check does not always work, deeming the celery worker "healthy" while it is not. We have been unable to replicate consistently when this occurs. It seems to happen during cluster reboots or full re-deployments. This seems to be happening quite regularly for objecten.
What we have observed is that while the celery worker is no longer processing requests, confirmed by running a celery inspect command (all celery commands will fail with this error).
The health check will continue to update the
celery_worker_heartbeat
file in/app/tmp/
. Therefore the celery worker is deemed to be healthy and the pod will not be restarted, requiring manual intervention after the issue is detected by the user.Second, the health check also does not seem to honor long running tasks, causing it to restart a pod while it is processing long running tasks. This is the main reasons the health check was created and we are not just using
celery inspect ping -d celery@$HOSTNAME"
, as described in https://medium.com/ambient-innovation/health-checks-for-celery-in-kubernetes-cf3274a3e106 .Therefore we now advise to use the native celery ping with a high threshold / periodseconds so the celery worker had time to process longer running tasks.
The text was updated successfully, but these errors were encountered: