Fix cpool_delete() #8627

rickard-green · 2024-06-28T21:34:14Z

When setting modification marker on the prev field of a carrier to delete from a pool, we back off and wait for the content of the field to receive expected value if it did not have that from the beginning. Due to a copy-paste bug; when this happened, we waited on a completely different memory location which caused the scheduler thread doing this to get stuck forever. This is obviously a very rare scenario, since this bug has been present for 11 years without being reported.

When setting modification marker on the 'prev' field of a carrier to delete from a pool, we back off and wait for the content of the field to receive expected value if it did not have that from the beginning. Due to a copy-paste bug; when this happened, we waited on a completely different memory location which caused the scheduler thread doing this to get stuck forever. This is obviously a very rare scenario, since this bug has been present for 11 years without being reported.

github-actions · 2024-06-28T21:35:00Z

CT Test Results

3 files 141 suites 49m 1s ⏱️
1 589 tests 1 539 ✅ 49 💤 1 ❌
2 290 runs 2 220 ✅ 69 💤 1 ❌

For more details on these failures, see this check.

Results for commit e176896.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

sverker · 2024-07-03T12:19:08Z

Merged to maint and master for OTP-27.1 and OTP-28.0.
Also scheduled for next OTP-26.2.5., OTP-25.3.2. and OTP-24.3.4.*.

martinsumner · 2024-07-12T11:21:15Z

On 21st June we had an issue on a single node in a non-production cluster (running Riak with Erlang 24.3.4.17).

On this node 2 CPU cores (numbers 5 and 12 of 16) went immediately to 100% usage, despite a relatively small amount of background traffic (one would expect about 5% utilisation at that time). The two cores remained then locked at 100% CPU usage (all us no sys or wait time) and memory escalated. 23 minutes later a third core suddenly went to 100% utilisation.

Eventually the memory usage escalated to such an extent that the OOM killer intervened. As this was a single node in the cluster, no operator intervened (as the cluster overall continued to operate).

Question we have is if this the same issue as #8613 and may be resolved by this PR? Our investigation has ruled out known causes within Riak (but not unknowns!) and the other potential issue we've seen with the VM (such as the 100% core utilisation caused by hanging remote shell #4343).

We don't have any detailed debugging from the node at the time, so I appreciate that might be an impossible question to answer. So primarily I'm interested to know if we get these symptoms again, what should we do try and grab information relevant to determine if this correlates to the issue (and so would be fixed by the PR)?

Note that this is the first known instance we've seen on these conditions, but we recently moved from OTP 22 to OTP 24.

rickard-green added team:VM Assigned to OTP team VM fix labels Jun 28, 2024

rickard-green assigned sverker Jun 28, 2024

rickard-green changed the base branch from master to maint June 28, 2024 21:34

rickard-green linked an issue Jun 28, 2024 that may be closed by this pull request

Schedulers stuck in cpool_delete and ets locks #8613

Closed

rickard-green mentioned this pull request Jun 28, 2024

Schedulers stuck in cpool_delete and ets locks #8613

Closed

rickard-green added the testing currently being tested, tag is used by OTP internal CI label Jun 28, 2024

sverker merged commit fde025a into erlang:maint Jul 2, 2024
16 of 18 checks passed

sverker mentioned this pull request Oct 10, 2024

BEAM freeze due to persistent_term usage #8917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cpool_delete() #8627

Fix cpool_delete() #8627

rickard-green commented Jun 28, 2024

github-actions bot commented Jun 28, 2024 •

edited

Loading

sverker commented Jul 3, 2024

martinsumner commented Jul 12, 2024 •

edited

Loading

Fix cpool_delete() #8627

Fix cpool_delete() #8627

Conversation

rickard-green commented Jun 28, 2024

github-actions bot commented Jun 28, 2024 • edited Loading

CT Test Results

Artifacts

sverker commented Jul 3, 2024

martinsumner commented Jul 12, 2024 • edited Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

martinsumner commented Jul 12, 2024 •

edited

Loading