You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the DB is down for even a few seconds, it can completely hose the job runner, causing lost job logs and preventing jobs from moving from "in process" to "success". Though this is a rare occurrence, it happens occasionally, and manually fixing an orphaned job is very annoying.
The fix is probably to replicate Go's built-in DB.retry functionality, but with a delay between retries, and a larger threshold than the current Go maxBadConnRetries value.
This probably should only affect background jobs - the HTTP server is relatively safe even if there's an outage. Worst case with HTTP requests is somebody gets an immediate notification that they need to try again. When it happens in the job runner, though, there's no real-time way to deal with it, so it just ends up losing logs or getting a job stuck.
The text was updated successfully, but these errors were encountered:
This seems like it should almost never happen, but we're finding that the more things we put behind HAProxy, the more likely it is for a configuration to stop all services, even if only for a second or two. A local database would solve this, but that takes away the value of having redundancy at the HAProxy level. Losing one DB head in our current setup doesn't stop the app from continuing - we have to lose all three heads.
When the DB is down for even a few seconds, it can completely hose the job runner, causing lost job logs and preventing jobs from moving from "in process" to "success". Though this is a rare occurrence, it happens occasionally, and manually fixing an orphaned job is very annoying.
The fix is probably to replicate Go's built-in
DB.retry
functionality, but with a delay between retries, and a larger threshold than the current GomaxBadConnRetries
value.This probably should only affect background jobs - the HTTP server is relatively safe even if there's an outage. Worst case with HTTP requests is somebody gets an immediate notification that they need to try again. When it happens in the job runner, though, there's no real-time way to deal with it, so it just ends up losing logs or getting a job stuck.
The text was updated successfully, but these errors were encountered: