Why is RescueStuckJobsAfter so high? #687

rgalanakis · 2024-12-06T01:19:06Z

rgalanakis
Dec 6, 2024

Hello, we have had some OOM crashes on our server (unrelated to River) and we ended up having stuck jobs that were rescued after an hour.

I noticed that the default Job Timeout is 1 minute, which is very fast (a good default, encourages fast jobs!). But the default 'rescue stuck jobs after' is 1 hour, which seems very high. We want to lower the rescue value (so recover more quickly from OOMs), but I want to make sure I'm not missing how this is designed to work.

If our jobs are designed to take no more than a minute, and our job timeout is a minute, is it safe to have stuck jobs rescued after, say, 2 minutes? If not, I must be misunderstanding the interaction between the timeout and the rescue timing- can you maybe explain how it works? If it is the case that, with a job timeout of 1 minute, no jobs will run for longer than 1 minute, so we can rescue jobs running for 2 minutes, why is the default rescue so high?

Is the rescue time so high, to handle per-worker timeout customizations? If so, and we had a worker with a job timeout of 2 hours, would the stuck job watcher cause a problem and think the job is stuck?

Thanks for helping me sort this out!

bgentry · 2024-12-06T04:05:54Z

bgentry
Dec 6, 2024
Maintainer

Hi @rgalanakis, so the issue is that in the current version of River, this RescueStuckJobsAfter acts as an upper limit on how long jobs can run before they're considered "stuck" and are retried elsewhere. A background maintenance process uses this setting to determine when jobs are in this state. The reason it's fairly high by default is to allow long-running jobs out-of-the-box.

If your jobs are always responsive to context cancellation and you're cleanly shutting down on exit, you shouldn't encounter this situation except when there's a crash.

There's currently no configurability for this other than the one time.Duration setting—and no way to vary the behavior by individual kinds of workers, per jobs, etc. Neither of these are great, and I'm hoping we can provide some better options in the future. But for now, this is what we have.

2 replies

rgalanakis Dec 6, 2024
Author

Thanks Blake. We do only encounter this when there's a crash, which is not uncommon for this service because it's pretty esoteric in what it's doing and can see memory ballooning. So OOM kills are rare but still part of life we can't really get around.

So based on your answer, and assuming our jobs respond to cancellation, we should be safe to lower RescueStuckJobsAfter to a bit over JobTimeout (we're not customizing it for individual jobs)?

For the sake of my understanding: if I set an individual worker's JobTimeout to -1, and it runs for > 1 hour, the rescue maintenance process think it's stuck, and start a new job?

bgentry Dec 6, 2024
Maintainer

Yes, it's the JobCleaner that's responsible for this. First it fetches a batch of stuck jobs and then either retries w/ backoff or discards them depending on whether they have remaining attempts available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is RescueStuckJobsAfter so high? #687

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Why is RescueStuckJobsAfter so high? #687

rgalanakis Dec 6, 2024

Replies: 1 comment · 2 replies

bgentry Dec 6, 2024 Maintainer

rgalanakis Dec 6, 2024 Author

bgentry Dec 6, 2024 Maintainer

rgalanakis
Dec 6, 2024

Replies: 1 comment 2 replies

bgentry
Dec 6, 2024
Maintainer

rgalanakis Dec 6, 2024
Author

bgentry Dec 6, 2024
Maintainer