Job runner: add retry functionality to all SQL operations #293

jechols · 2024-02-14T18:44:04Z

When the DB is down for even a few seconds, it can completely hose the job runner, causing lost job logs and preventing jobs from moving from "in process" to "success". Though this is a rare occurrence, it happens occasionally, and manually fixing an orphaned job is very annoying.

The fix is probably to replicate Go's built-in DB.retry functionality, but with a delay between retries, and a larger threshold than the current Go maxBadConnRetries value.

This probably should only affect background jobs - the HTTP server is relatively safe even if there's an outage. Worst case with HTTP requests is somebody gets an immediate notification that they need to try again. When it happens in the job runner, though, there's no real-time way to deal with it, so it just ends up losing logs or getting a job stuck.

The text was updated successfully, but these errors were encountered:

jechols · 2024-02-14T18:46:19Z

This seems like it should almost never happen, but we're finding that the more things we put behind HAProxy, the more likely it is for a configuration to stop all services, even if only for a second or two. A local database would solve this, but that takes away the value of having redundancy at the HAProxy level. Losing one DB head in our current setup doesn't stop the app from continuing - we have to lose all three heads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job runner: add retry functionality to all SQL operations #293

Job runner: add retry functionality to all SQL operations #293

jechols commented Feb 14, 2024

jechols commented Feb 14, 2024

Job runner: add retry functionality to all SQL operations #293

Job runner: add retry functionality to all SQL operations #293

Comments

jechols commented Feb 14, 2024

jechols commented Feb 14, 2024