Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the job jitter/delay around sidekiq retries #695

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gitstart-app[bot]
Copy link
Contributor

@gitstart-app gitstart-app bot commented Dec 12, 2024

What this PR achieves?

  • Update these jobs:
    app/jobs/store_submissions/test_flight/find_build_job.rb

    app/jobs/store_submissions/app_store/find_build_job.rb

  • Remove thesidekiq_retry_in , sidekiq_entries_exhausted and sidekiq_retry_in_block

@gitstart-tramline
Copy link

gitstart-tramline commented Dec 12, 2024

Hi @kitallis

We’d like your guidance on testing this PR. Could you help validate the scenario provided in the attached screenshot, or are there any additional cases we should consider to ensure the fix is thoroughly tested?

Screenshot 2024-12-11 at 6 59 37 PM

Additionally, are there specific benchmarks or indicators we should monitor to confirm the changes are working as intended?

@kitallis
Copy link
Member

Hi @kitallis

We’d like your guidance on testing this PR. Could you help validate the scenario provided in the attached screenshot, or are there any additional cases we should consider to ensure the fix is thoroughly tested?

Screenshot 2024-12-11 at 6 59 37 PM

Additionally, are there specific benchmarks or indicators we should monitor to confirm the changes are working as intended?

I think this cannot be tested reliably on a unit-level. We'll have to do some sort of real-time test. What we can do is issue 100s of retries for a particular job, perhaps StoreRollouts::AppStore::FindLiveReleaseJob since that has seen the most issues due to jitter and ensure that it runs on the correct ticks of time, and the variability error under controlled circumstances (like a local env), with the correct backoffs set, is not off by more than a 1 minute or so.

@kitallis
Copy link
Member

Separately, I think the idea for this change is generally fine, but the implementation seems too complicated. If this is your final approach, we should stop and re-evaluate because I think there are issues with it that won't scale.

@gitstart-tramline
Copy link

Separately, I think the idea for this change is generally fine, but the implementation seems too complicated. If this is your final approach, we should stop and re-evaluate because I think there are issues with it that won't scale.

Thank you for your feedback. We are exploring other strategies to reduce the implementation complexity.

@gitstart-app gitstart-app bot marked this pull request as ready for review January 3, 2025 10:08
@gitstart-tramline
Copy link

Hi @kitallis, we have pushed changes with the new approach we discussed here

We migrated two jobs

@kitallis
Copy link
Member

kitallis commented Jan 3, 2025

Thanks for the PR, but please look at V2::TriggerSubmissionsJob (as I'd mentioned in the issue as well).

It's a very simple system as it recurses by incrementing retry counts in the params. No need to maintain the jid or count state separately in redis. It's really straightforward, if we fail, retry and fail+exit if we run out of retries.

I won't be able to merge this as it's too unnecessarily complicated imo.

Am I missing some innate complexity here?

@gitstart-tramline
Copy link

@kitallis
The approach in the above file might not be a best solution because it may lead to issues in the future.

For example:
If we launch jobs on the production server, multiple jobs will run in parallel. Now, suppose the server goes down for some reason in the middle of processing. In that case, the retry mechanism will fail or crash.

On the other hand, with our approach, we can easily track where the jobs stopped and resume them from that exact point.

Apart from that, if we follow straight forward approach there will be scalability issues because without a centralized system, our system cannot scale efficiently. Currently, our approach allows us to add features like real-time tracking, dynamic retries, or custom handling for different error types without impacting the core job logic.

Additionally, if we implement a straightforward retry mechanism, we won't be able to cover all edge cases in our rspec tests, especially the ones related to job failures, because jobs run in the background.

@kitallis
Copy link
Member

kitallis commented Jan 6, 2025

@kitallis The approach in the above file might not be a best solution because it may lead to issues in the future.

I think your line of thinking is fair, but the implementation doesn't do that much more to help that cause.

I've responded inline:

If we launch jobs on the production server, multiple jobs will run in parallel. Now, suppose the server goes down for some reason in the middle of processing.

If a worker fails in between processing, Sidekiq will push the job back to enqueued, because it uses BRPOPLPUSH underneath, which blocks on the task until it's either completed or timed out.

Currently, our approach allows us to add features like real-time tracking, dynamic retries, or custom handling for different error types without impacting the core job logic.

This is fair and worth considering, but I think that's a whole other can of worms, since we won't be able to piggyback on the Sidekiq dashboard easily, we will have to extend that too – since this is a new mechanism. Additionally, this current implementation would suffer from the same problem with the worker dying, if the worker dies before you can write the job ID to redis, we won't be able to recover or track retries anyway. The recoverability is probabilistic.

if we implement a straightforward retry mechanism, we won't be able to cover all edge cases in our rspec tests, especially the ones related to job failures, because jobs run in the background.

This should be possible, since in tests the jobs aren't actually enqueued in background.


In summary,

  • Since Sidekiq re-enqueues incomplete jobs on worker-death, the job is unlikely to get lost.
  • If worker dies in between, there's no guarantee that your retry tracking gets written, since it could die before the write.
  • When we do a "manual retry" like in the example I mentioned, instead of a separate zadd on the retry queue like how sidekiq does it in sidekiq_retry_in, it just happens to do an explicit enqueue on the scheduled queue via perform_later. I think the integrity guarantees between the two approaches are roughly similar.
  • It is true that the DSL provides a better dev experience, since it separates retry from core job logic, but I was hoping we'd do that without any external deps, just pure ruby. I'm not certain whether that's possible or not.

@gitstart-tramline
Copy link

It is true that the DSL provides a better dev experience, since it separates retry from core job logic, but I was hoping we'd do that without any external deps, just pure ruby. I'm not certain whether that's possible or not.

Thank you for the clarification. We will update the implementation to follow the approach used in V2::TriggerSubmissionsJob.

@kitallis
Copy link
Member

kitallis commented Jan 6, 2025

If a worker fails in between processing, Sidekiq will push the job back to enqueued, because it uses BRPOPLPUSH underneath, which blocks on the task until it's either completed or timed out.

On second thought, I was partly wrong about the above behavior. The above only happens on Sidekiq Pro. See super_fetch.

However, that doesn't change the core point.

We can later fix the reliability issue by using a third-party gem like gitlab-reliable-fetch

@gitstart-tramline
Copy link

Hi @kitallis, we have now implemented the solution using plain ruby code. Please review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants