Add very basic version of job unstuck-ing for non-txn jobs that hang … #57

bretthoerner · 2024-01-18T22:02:10Z

…in 'running'

This is a v1 implementation so we don't forget to have something. The query checks for jobs that have been running for over 2 minutes and puts them back to available (the action that made them running already added an attempt).

Porting all of our retry time calculation logic into SQL didn't seem worth it, nor did SELECTing all the rows, doing the exact retry-time calculation per row in Rust, and updating them individually. We shouldn't have jobs getting stuck unless pods crash, so this isn't the common path for retries.

…in 'running'

xvello · 2024-01-26T09:05:51Z

hook-janitor/src/webhooks.rs

+            .acquire()
+            .await
+            .map_err(|e| WebhookCleanerError::AcquireConnError { error: e })?;
+


Worth adding a Harry-comment about:

why we don't increment attempt when unlocking rows. My first instinct was to do so, but it's not desired

what happens if for some reason the previous pod comes back to life and finishes it: I'm assuming that we'll duplicate the output, and record the success one or twice (depending on whether janitor runs between both updates?). Again, it's fine if that's the case for now (the events already have dupes from processing), just would love a quick list of tradeoffs while it's fresh in your head.

Add very basic version of job unstuck-ing for non-txn jobs that hang …

26fa4f4

…in 'running'

bretthoerner requested a review from a team January 18, 2024 22:05

xvello approved these changes Jan 26, 2024

View reviewed changes

bretthoerner added 2 commits January 30, 2024 07:37

Merge remote-tracking branch 'origin/main' into brett/unstuck-v0

4492e48

comment

9db077f

bretthoerner enabled auto-merge (squash) January 30, 2024 16:39

bretthoerner merged commit 6729401 into main Jan 30, 2024
4 checks passed

bretthoerner deleted the brett/unstuck-v0 branch January 30, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add very basic version of job unstuck-ing for non-txn jobs that hang … #57

Add very basic version of job unstuck-ing for non-txn jobs that hang … #57

bretthoerner commented Jan 18, 2024 •

edited

Loading

xvello Jan 26, 2024

Add very basic version of job unstuck-ing for non-txn jobs that hang … #57

Add very basic version of job unstuck-ing for non-txn jobs that hang … #57

Conversation

bretthoerner commented Jan 18, 2024 • edited Loading

xvello Jan 26, 2024

Choose a reason for hiding this comment

bretthoerner commented Jan 18, 2024 •

edited

Loading