-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the job jitter/delay around sidekiq retries #695
base: main
Are you sure you want to change the base?
Conversation
Hi @kitallis We’d like your guidance on testing this PR. Could you help validate the scenario provided in the attached screenshot, or are there any additional cases we should consider to ensure the fix is thoroughly tested? Additionally, are there specific benchmarks or indicators we should monitor to confirm the changes are working as intended? |
I think this cannot be tested reliably on a unit-level. We'll have to do some sort of real-time test. What we can do is issue 100s of retries for a particular job, perhaps |
Separately, I think the idea for this change is generally fine, but the implementation seems too complicated. If this is your final approach, we should stop and re-evaluate because I think there are issues with it that won't scale. |
Thank you for your feedback. We are exploring other strategies to reduce the implementation complexity. |
Thanks for the PR, but please look at V2::TriggerSubmissionsJob (as I'd mentioned in the issue as well). It's a very simple system as it recurses by incrementing retry counts in the params. No need to maintain the jid or count state separately in redis. It's really straightforward, if we fail, retry and fail+exit if we run out of retries. I won't be able to merge this as it's too unnecessarily complicated imo. Am I missing some innate complexity here? |
@kitallis
On the other hand, with our approach, we can easily track where the jobs stopped and resume them from that exact point. Apart from that, if we follow straight forward approach there will be scalability issues because without a centralized system, our system cannot scale efficiently. Currently, our approach allows us to add features like real-time tracking, dynamic retries, or custom handling for different error types without impacting the core job logic. Additionally, if we implement a straightforward retry mechanism, we won't be able to cover all edge cases in our rspec tests, especially the ones related to job failures, because jobs run in the background. |
I think your line of thinking is fair, but the implementation doesn't do that much more to help that cause. I've responded inline:
If a worker fails in between processing, Sidekiq will push the job back to enqueued, because it uses
This is fair and worth considering, but I think that's a whole other can of worms, since we won't be able to piggyback on the Sidekiq dashboard easily, we will have to extend that too – since this is a new mechanism. Additionally, this current implementation would suffer from the same problem with the worker dying, if the worker dies before you can write the job ID to redis, we won't be able to recover or track retries anyway. The recoverability is probabilistic.
This should be possible, since in tests the jobs aren't actually enqueued in background. In summary,
|
Thank you for the clarification. We will update the implementation to follow the approach used in V2::TriggerSubmissionsJob. |
On second thought, I was partly wrong about the above behavior. The above only happens on Sidekiq Pro. See super_fetch. However, that doesn't change the core point. We can later fix the reliability issue by using a third-party gem like gitlab-reliable-fetch |
Hi @kitallis, we have now implemented the solution using plain ruby code. Please review |
What this PR achieves?
Update these jobs:
app/jobs/store_submissions/test_flight/find_build_job.rb
app/jobs/store_submissions/app_store/find_build_job.rb
Remove the
sidekiq_retry_in
,sidekiq_entries_exhausted
andsidekiq_retry_in_block