-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cpool_delete() #8627
Fix cpool_delete() #8627
Conversation
When setting modification marker on the 'prev' field of a carrier to delete from a pool, we back off and wait for the content of the field to receive expected value if it did not have that from the beginning. Due to a copy-paste bug; when this happened, we waited on a completely different memory location which caused the scheduler thread doing this to get stuck forever. This is obviously a very rare scenario, since this bug has been present for 11 years without being reported.
CT Test Results 3 files 141 suites 49m 1s ⏱️ For more details on these failures, see this check. Results for commit e176896. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
Merged to maint and master for OTP-27.1 and OTP-28.0. |
On 21st June we had an issue on a single node in a non-production cluster (running Riak with Erlang 24.3.4.17). On this node 2 CPU cores (numbers 5 and 12 of 16) went immediately to 100% usage, despite a relatively small amount of background traffic (one would expect about 5% utilisation at that time). The two cores remained then locked at 100% CPU usage (all us no sys or wait time) and memory escalated. 23 minutes later a third core suddenly went to 100% utilisation. Eventually the memory usage escalated to such an extent that the OOM killer intervened. As this was a single node in the cluster, no operator intervened (as the cluster overall continued to operate). Question we have is if this the same issue as #8613 and may be resolved by this PR? Our investigation has ruled out known causes within Riak (but not unknowns!) and the other potential issue we've seen with the VM (such as the 100% core utilisation caused by hanging remote shell #4343). We don't have any detailed debugging from the node at the time, so I appreciate that might be an impossible question to answer. So primarily I'm interested to know if we get these symptoms again, what should we do try and grab information relevant to determine if this correlates to the issue (and so would be fixed by the PR)? Note that this is the first known instance we've seen on these conditions, but we recently moved from OTP 22 to OTP 24. |
When setting modification marker on the
prev
field of a carrier to delete from a pool, we back off and wait for the content of the field to receive expected value if it did not have that from the beginning. Due to a copy-paste bug; when this happened, we waited on a completely different memory location which caused the scheduler thread doing this to get stuck forever. This is obviously a very rare scenario, since this bug has been present for 11 years without being reported.