Skip to content
This repository has been archived by the owner on Feb 8, 2024. It is now read-only.

Add very basic version of job unstuck-ing for non-txn jobs that hang … #57

Merged
merged 3 commits into from
Jan 30, 2024

Conversation

bretthoerner
Copy link
Contributor

@bretthoerner bretthoerner commented Jan 18, 2024

…in 'running'

This is a v1 implementation so we don't forget to have something. The query checks for jobs that have been running for over 2 minutes and puts them back to available (the action that made them running already added an attempt).

Porting all of our retry time calculation logic into SQL didn't seem worth it, nor did SELECTing all the rows, doing the exact retry-time calculation per row in Rust, and updating them individually. We shouldn't have jobs getting stuck unless pods crash, so this isn't the common path for retries.

@bretthoerner bretthoerner requested a review from a team January 18, 2024 22:05
.acquire()
.await
.map_err(|e| WebhookCleanerError::AcquireConnError { error: e })?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a Harry-comment about:

  • why we don't increment attempt when unlocking rows. My first instinct was to do so, but it's not desired
  • what happens if for some reason the previous pod comes back to life and finishes it: I'm assuming that we'll duplicate the output, and record the success one or twice (depending on whether janitor runs between both updates?). Again, it's fine if that's the case for now (the events already have dupes from processing), just would love a quick list of tradeoffs while it's fresh in your head.

@bretthoerner bretthoerner enabled auto-merge (squash) January 30, 2024 16:39
@bretthoerner bretthoerner merged commit 6729401 into main Jan 30, 2024
4 checks passed
@bretthoerner bretthoerner deleted the brett/unstuck-v0 branch January 30, 2024 16:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants