You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let me start this with - Thank You @socialpanda for Superworkers and Monitor! It really makes for a sweet combo for what I am doing with Sidekiq.
The Issue:
If a superworker has sequential groups of parallel subjobs with children that all finish at the same time (within milliseconds) - it is possible for the parent next subjob to be enqueued multiple times. This could probably manifest itself in other ways, but this was the core symptom I was seeing. The root cause is in descendants_are_complete >> enqueue(next_subjob).
Example:
Superworker
parallel do
quick_job
quick_job
quick_job
quick_job
quick_job
end
parallel do
quick_job_2
quick_job_2
quick_job_2
quick_job_2
quick_job_2
end
end
SOMETIMES, not always the second parallel set will get sent to sidekiq multiple times. The more quick_jobs (or child subjobs that finish quickly and/or around the same time) there are the higher the probability this can occur.
The whole approach of this fix was to minimally change code. Overall the fix is a band-aid. Let me know if you want a pull request.
The quick fix was to lock the next subjob row, update it to an interim status of 'queued' to ensure multiple complete subjobs would not pick it up. Then pass the original next_subjob with the original status of initialized. I decided to send the original subjob because I did not want to change the enqueue logic that only processes initialized, and generally perhaps break other areas of logic.
I think a better fix would involve thoroughly decoupling the subjob complete process from the enqueue process.
Refactor code to ensure that each subjob is only updating it's own subjob record. Don't touch parents, and relatives, and children, etc.
Create a separate process in it's own thread that is responsible for monitoring parallel, batch and superworker job states. This process would determine what's next, properly update subjob records, and send subjobs to sidekiq. Overall this might entail using a proper queuing mechanism designed for highly parallel environments and could better support multi-node sidekiq clients... probably makes sense to move a portion of this into redis.
Side notes:
A. During debugging and testing, I implemented optimistic locking on the sidekiq_superworker_subjobs table by adding lock_version col and setting ActiveRecord::Base.lock_optimistically = true. The result overall, is that there are several places where the code is actually updating subjob records that are stale. Even though most of the time it is just the updated_at field that is stale - the results show that "completes" by many subjobs end up touching a wide swatch of subjob records.
B. Another tidbit I noticed was mixed use of subjob.update_attribute and subjob.update_attribute_s_. These two methods differ quite a bit in what they do, and some differences between rails 3 & 4 as well. I would suggest just using update_attribute_s_ for consistency. Also, those methods will just return a true/false, even if an active record error occurs. Perhaps the update_attributes! approach to just be safe?
C. FYI, our environment:
JDK7_25, Torquebox 3.0, JRuby 1.7.11, Rails 4.0.3, ActiveRecord on MySql 5.1.69
Sidekiq 2.17.7, Sidetiq 0.5.0, Sidekiq-benchmark, Sidekiq-failures, Sidekiq-unique-jobs
Sidekiq Client concurrency 30
The text was updated successfully, but these errors were encountered:
Let me start this with - Thank You @socialpanda for Superworkers and Monitor! It really makes for a sweet combo for what I am doing with Sidekiq.
The Issue:
If a superworker has sequential groups of parallel subjobs with children that all finish at the same time (within milliseconds) - it is possible for the parent next subjob to be enqueued multiple times. This could probably manifest itself in other ways, but this was the core symptom I was seeing. The root cause is in descendants_are_complete >> enqueue(next_subjob).
Example:
Superworker
parallel do
quick_job
quick_job
quick_job
quick_job
quick_job
end
parallel do
quick_job_2
quick_job_2
quick_job_2
quick_job_2
quick_job_2
end
end
SOMETIMES, not always the second parallel set will get sent to sidekiq multiple times. The more quick_jobs (or child subjobs that finish quickly and/or around the same time) there are the higher the probability this can occur.
The whole approach of this fix was to minimally change code. Overall the fix is a band-aid. Let me know if you want a pull request.
thoughtchad/sidekiq-superworker@373d06a
The quick fix was to lock the next subjob row, update it to an interim status of 'queued' to ensure multiple complete subjobs would not pick it up. Then pass the original next_subjob with the original status of initialized. I decided to send the original subjob because I did not want to change the enqueue logic that only processes initialized, and generally perhaps break other areas of logic.
I think a better fix would involve thoroughly decoupling the subjob complete process from the enqueue process.
Side notes:
A. During debugging and testing, I implemented optimistic locking on the sidekiq_superworker_subjobs table by adding lock_version col and setting ActiveRecord::Base.lock_optimistically = true. The result overall, is that there are several places where the code is actually updating subjob records that are stale. Even though most of the time it is just the updated_at field that is stale - the results show that "completes" by many subjobs end up touching a wide swatch of subjob records.
B. Another tidbit I noticed was mixed use of subjob.update_attribute and subjob.update_attribute_s_. These two methods differ quite a bit in what they do, and some differences between rails 3 & 4 as well. I would suggest just using update_attribute_s_ for consistency. Also, those methods will just return a true/false, even if an active record error occurs. Perhaps the update_attributes! approach to just be safe?
C. FYI, our environment:
JDK7_25, Torquebox 3.0, JRuby 1.7.11, Rails 4.0.3, ActiveRecord on MySql 5.1.69
Sidekiq 2.17.7, Sidetiq 0.5.0, Sidekiq-benchmark, Sidekiq-failures, Sidekiq-unique-jobs
Sidekiq Client concurrency 30
The text was updated successfully, but these errors were encountered: