Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

thoughtchad · 2014-03-12T19:46:30Z

Let me start this with - Thank You @socialpanda for Superworkers and Monitor! It really makes for a sweet combo for what I am doing with Sidekiq.

The Issue:
If a superworker has sequential groups of parallel subjobs with children that all finish at the same time (within milliseconds) - it is possible for the parent next subjob to be enqueued multiple times. This could probably manifest itself in other ways, but this was the core symptom I was seeing. The root cause is in descendants_are_complete >> enqueue(next_subjob).

Example:
Superworker
parallel do
quick_job
quick_job
quick_job
quick_job
quick_job
end
parallel do
quick_job_2
quick_job_2
quick_job_2
quick_job_2
quick_job_2
end
end

SOMETIMES, not always the second parallel set will get sent to sidekiq multiple times. The more quick_jobs (or child subjobs that finish quickly and/or around the same time) there are the higher the probability this can occur.

The whole approach of this fix was to minimally change code. Overall the fix is a band-aid. Let me know if you want a pull request.

thoughtchad/sidekiq-superworker@373d06a

The quick fix was to lock the next subjob row, update it to an interim status of 'queued' to ensure multiple complete subjobs would not pick it up. Then pass the original next_subjob with the original status of initialized. I decided to send the original subjob because I did not want to change the enqueue logic that only processes initialized, and generally perhaps break other areas of logic.

I think a better fix would involve thoroughly decoupling the subjob complete process from the enqueue process.

Refactor code to ensure that each subjob is only updating it's own subjob record. Don't touch parents, and relatives, and children, etc.
Create a separate process in it's own thread that is responsible for monitoring parallel, batch and superworker job states. This process would determine what's next, properly update subjob records, and send subjobs to sidekiq. Overall this might entail using a proper queuing mechanism designed for highly parallel environments and could better support multi-node sidekiq clients... probably makes sense to move a portion of this into redis.

Side notes:
A. During debugging and testing, I implemented optimistic locking on the sidekiq_superworker_subjobs table by adding lock_version col and setting ActiveRecord::Base.lock_optimistically = true. The result overall, is that there are several places where the code is actually updating subjob records that are stale. Even though most of the time it is just the updated_at field that is stale - the results show that "completes" by many subjobs end up touching a wide swatch of subjob records.

B. Another tidbit I noticed was mixed use of subjob.update_attribute and subjob.update_attribute_s_. These two methods differ quite a bit in what they do, and some differences between rails 3 & 4 as well. I would suggest just using update_attribute_s_ for consistency. Also, those methods will just return a true/false, even if an active record error occurs. Perhaps the update_attributes! approach to just be safe?

C. FYI, our environment:
JDK7_25, Torquebox 3.0, JRuby 1.7.11, Rails 4.0.3, ActiveRecord on MySql 5.1.69
Sidekiq 2.17.7, Sidetiq 0.5.0, Sidekiq-benchmark, Sidekiq-failures, Sidekiq-unique-jobs
Sidekiq Client concurrency 30

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

thoughtchad commented Mar 12, 2014

Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

Comments

thoughtchad commented Mar 12, 2014