Fix the job jitter/delay around sidekiq retries #695

gitstart-app · 2024-12-12T14:25:25Z

What this PR achieves?

Update these jobs:
app/jobs/store_submissions/test_flight/find_build_job.rb

app/jobs/store_submissions/app_store/find_build_job.rb
Remove thesidekiq_retry_in , sidekiq_entries_exhausted and sidekiq_retry_in_block

gitstart-tramline · 2024-12-12T14:28:26Z

We’d like your guidance on testing this PR. Could you help validate the scenario provided in the attached screenshot, or are there any additional cases we should consider to ensure the fix is thoroughly tested?

Additionally, are there specific benchmarks or indicators we should monitor to confirm the changes are working as intended?

kitallis · 2024-12-13T10:54:05Z

Hi @kitallis

We’d like your guidance on testing this PR. Could you help validate the scenario provided in the attached screenshot, or are there any additional cases we should consider to ensure the fix is thoroughly tested?

Additionally, are there specific benchmarks or indicators we should monitor to confirm the changes are working as intended?

I think this cannot be tested reliably on a unit-level. We'll have to do some sort of real-time test. What we can do is issue 100s of retries for a particular job, perhaps StoreRollouts::AppStore::FindLiveReleaseJob since that has seen the most issues due to jitter and ensure that it runs on the correct ticks of time, and the variability error under controlled circumstances (like a local env), with the correct backoffs set, is not off by more than a 1 minute or so.

kitallis · 2024-12-13T10:56:16Z

Separately, I think the idea for this change is generally fine, but the implementation seems too complicated. If this is your final approach, we should stop and re-evaluate because I think there are issues with it that won't scale.

gitstart-tramline · 2024-12-16T09:55:36Z

Separately, I think the idea for this change is generally fine, but the implementation seems too complicated. If this is your final approach, we should stop and re-evaluate because I think there are issues with it that won't scale.

Thank you for your feedback. We are exploring other strategies to reduce the implementation complexity.

gitstart-tramline · 2025-01-03T10:25:47Z

Hi @kitallis, we have pushed changes with the new approach we discussed here

We migrated two jobs

kitallis · 2025-01-03T12:07:28Z

Thanks for the PR, but please look at V2::TriggerSubmissionsJob (as I'd mentioned in the issue as well).

It's a very simple system as it recurses by incrementing retry counts in the params. No need to maintain the jid or count state separately in redis. It's really straightforward, if we fail, retry and fail+exit if we run out of retries.

I won't be able to merge this as it's too unnecessarily complicated imo.

Am I missing some innate complexity here?

gitstart-tramline · 2025-01-06T05:46:27Z

@kitallis
The approach in the above file might not be a best solution because it may lead to issues in the future.

For example:
If we launch jobs on the production server, multiple jobs will run in parallel. Now, suppose the server goes down for some reason in the middle of processing. In that case, the retry mechanism will fail or crash.

On the other hand, with our approach, we can easily track where the jobs stopped and resume them from that exact point.

Apart from that, if we follow straight forward approach there will be scalability issues because without a centralized system, our system cannot scale efficiently. Currently, our approach allows us to add features like real-time tracking, dynamic retries, or custom handling for different error types without impacting the core job logic.

Additionally, if we implement a straightforward retry mechanism, we won't be able to cover all edge cases in our rspec tests, especially the ones related to job failures, because jobs run in the background.

kitallis · 2025-01-06T08:39:59Z

@kitallis The approach in the above file might not be a best solution because it may lead to issues in the future.

I think your line of thinking is fair, but the implementation doesn't do that much more to help that cause.

I've responded inline:

If we launch jobs on the production server, multiple jobs will run in parallel. Now, suppose the server goes down for some reason in the middle of processing.

If a worker fails in between processing, Sidekiq will push the job back to enqueued, because it uses BRPOPLPUSH underneath, which blocks on the task until it's either completed or timed out.

Currently, our approach allows us to add features like real-time tracking, dynamic retries, or custom handling for different error types without impacting the core job logic.

This is fair and worth considering, but I think that's a whole other can of worms, since we won't be able to piggyback on the Sidekiq dashboard easily, we will have to extend that too – since this is a new mechanism. Additionally, this current implementation would suffer from the same problem with the worker dying, if the worker dies before you can write the job ID to redis, we won't be able to recover or track retries anyway. The recoverability is probabilistic.

if we implement a straightforward retry mechanism, we won't be able to cover all edge cases in our rspec tests, especially the ones related to job failures, because jobs run in the background.

This should be possible, since in tests the jobs aren't actually enqueued in background.

In summary,

Since Sidekiq re-enqueues incomplete jobs on worker-death, the job is unlikely to get lost.
If worker dies in between, there's no guarantee that your retry tracking gets written, since it could die before the write.
When we do a "manual retry" like in the example I mentioned, instead of a separate zadd on the retry queue like how sidekiq does it in sidekiq_retry_in, it just happens to do an explicit enqueue on the scheduled queue via perform_later. I think the integrity guarantees between the two approaches are roughly similar.
It is true that the DSL provides a better dev experience, since it separates retry from core job logic, but I was hoping we'd do that without any external deps, just pure ruby. I'm not certain whether that's possible or not.

gitstart-tramline · 2025-01-06T11:08:57Z

It is true that the DSL provides a better dev experience, since it separates retry from core job logic, but I was hoping we'd do that without any external deps, just pure ruby. I'm not certain whether that's possible or not.

Thank you for the clarification. We will update the implementation to follow the approach used in V2::TriggerSubmissionsJob.

kitallis · 2025-01-06T13:11:03Z

If a worker fails in between processing, Sidekiq will push the job back to enqueued, because it uses BRPOPLPUSH underneath, which blocks on the task until it's either completed or timed out.

On second thought, I was partly wrong about the above behavior. The above only happens on Sidekiq Pro. See super_fetch.

However, that doesn't change the core point.

We can later fix the reliability issue by using a third-party gem like gitlab-reliable-fetch

gitstart-tramline · 2025-01-07T12:12:46Z

Hi @kitallis, we have now implemented the solution using plain ruby code. Please review

Fix the job jitter/delay around sidekiq retries

216814c

gitstart_bot added 2 commits January 3, 2025 18:04

Merge branch 'main' into TRAB-687

a0bc91f

Fix the job jitter/delay around sidekiq retries

785b37a

gitstart-app bot marked this pull request as ready for review January 3, 2025 10:08

gitstart_bot added 2 commits January 7, 2025 08:47

Merge branch 'main' into TRAB-687

a705af1

requested changes

a946ff7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the job jitter/delay around sidekiq retries #695

Fix the job jitter/delay around sidekiq retries #695

gitstart-app bot commented Dec 12, 2024 •

edited

Loading

gitstart-tramline commented Dec 12, 2024 •

edited

Loading

kitallis commented Dec 13, 2024

kitallis commented Dec 13, 2024

gitstart-tramline commented Dec 16, 2024

gitstart-tramline commented Jan 3, 2025

kitallis commented Jan 3, 2025

gitstart-tramline commented Jan 6, 2025

kitallis commented Jan 6, 2025

gitstart-tramline commented Jan 6, 2025

kitallis commented Jan 6, 2025

gitstart-tramline commented Jan 7, 2025

Fix the job jitter/delay around sidekiq retries #695

Are you sure you want to change the base?

Fix the job jitter/delay around sidekiq retries #695

Conversation

gitstart-app bot commented Dec 12, 2024 • edited Loading

gitstart-tramline commented Dec 12, 2024 • edited Loading

kitallis commented Dec 13, 2024

kitallis commented Dec 13, 2024

gitstart-tramline commented Dec 16, 2024

gitstart-tramline commented Jan 3, 2025

kitallis commented Jan 3, 2025

gitstart-tramline commented Jan 6, 2025

kitallis commented Jan 6, 2025

gitstart-tramline commented Jan 6, 2025

kitallis commented Jan 6, 2025

gitstart-tramline commented Jan 7, 2025

gitstart-app bot commented Dec 12, 2024 •

edited

Loading

gitstart-tramline commented Dec 12, 2024 •

edited

Loading