Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

Open
thoughtchad opened this issue Mar 12, 2014 · 0 comments

Comments

@thoughtchad
Copy link

Let me start this with - Thank You @socialpanda for Superworkers and Monitor! It really makes for a sweet combo for what I am doing with Sidekiq.

The Issue:
If a superworker has sequential groups of parallel subjobs with children that all finish at the same time (within milliseconds) - it is possible for the parent next subjob to be enqueued multiple times. This could probably manifest itself in other ways, but this was the core symptom I was seeing. The root cause is in descendants_are_complete >> enqueue(next_subjob).

Example:
Superworker
parallel do
quick_job
quick_job
quick_job
quick_job
quick_job
end
parallel do
quick_job_2
quick_job_2
quick_job_2
quick_job_2
quick_job_2
end
end

SOMETIMES, not always the second parallel set will get sent to sidekiq multiple times. The more quick_jobs (or child subjobs that finish quickly and/or around the same time) there are the higher the probability this can occur.

The whole approach of this fix was to minimally change code. Overall the fix is a band-aid. Let me know if you want a pull request.

thoughtchad/sidekiq-superworker@373d06a

The quick fix was to lock the next subjob row, update it to an interim status of 'queued' to ensure multiple complete subjobs would not pick it up. Then pass the original next_subjob with the original status of initialized. I decided to send the original subjob because I did not want to change the enqueue logic that only processes initialized, and generally perhaps break other areas of logic.

I think a better fix would involve thoroughly decoupling the subjob complete process from the enqueue process.

  1. Refactor code to ensure that each subjob is only updating it's own subjob record. Don't touch parents, and relatives, and children, etc.
  2. Create a separate process in it's own thread that is responsible for monitoring parallel, batch and superworker job states. This process would determine what's next, properly update subjob records, and send subjobs to sidekiq. Overall this might entail using a proper queuing mechanism designed for highly parallel environments and could better support multi-node sidekiq clients... probably makes sense to move a portion of this into redis.

Side notes:
A. During debugging and testing, I implemented optimistic locking on the sidekiq_superworker_subjobs table by adding lock_version col and setting ActiveRecord::Base.lock_optimistically = true. The result overall, is that there are several places where the code is actually updating subjob records that are stale. Even though most of the time it is just the updated_at field that is stale - the results show that "completes" by many subjobs end up touching a wide swatch of subjob records.

B. Another tidbit I noticed was mixed use of subjob.update_attribute and subjob.update_attribute_s_. These two methods differ quite a bit in what they do, and some differences between rails 3 & 4 as well. I would suggest just using update_attribute_s_ for consistency. Also, those methods will just return a true/false, even if an active record error occurs. Perhaps the update_attributes! approach to just be safe?

C. FYI, our environment:
JDK7_25, Torquebox 3.0, JRuby 1.7.11, Rails 4.0.3, ActiveRecord on MySql 5.1.69
Sidekiq 2.17.7, Sidetiq 0.5.0, Sidekiq-benchmark, Sidekiq-failures, Sidekiq-unique-jobs
Sidekiq Client concurrency 30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant