Why is RescueStuckJobsAfter so high? #687
Replies: 1 comment 2 replies
-
Hi @rgalanakis, so the issue is that in the current version of River, this If your jobs are always responsive to context cancellation and you're cleanly shutting down on exit, you shouldn't encounter this situation except when there's a crash. There's currently no configurability for this other than the one |
Beta Was this translation helpful? Give feedback.
-
Hello, we have had some OOM crashes on our server (unrelated to River) and we ended up having stuck jobs that were rescued after an hour.
I noticed that the default Job Timeout is 1 minute, which is very fast (a good default, encourages fast jobs!). But the default 'rescue stuck jobs after' is 1 hour, which seems very high. We want to lower the rescue value (so recover more quickly from OOMs), but I want to make sure I'm not missing how this is designed to work.
If our jobs are designed to take no more than a minute, and our job timeout is a minute, is it safe to have stuck jobs rescued after, say, 2 minutes? If not, I must be misunderstanding the interaction between the timeout and the rescue timing- can you maybe explain how it works? If it is the case that, with a job timeout of 1 minute, no jobs will run for longer than 1 minute, so we can rescue jobs running for 2 minutes, why is the default rescue so high?
Is the rescue time so high, to handle per-worker timeout customizations? If so, and we had a worker with a job timeout of 2 hours, would the stuck job watcher cause a problem and think the job is stuck?
Thanks for helping me sort this out!
Beta Was this translation helpful? Give feedback.
All reactions