[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory #3018

leixm · 2024-12-20T10:12:13Z

What changes were proposed in this pull request?

Congestion and MemoryManager should use pinnedDirectMemory instead of usedDirectMemory

Why are the changes needed?

In our production environment, after worker pausing, the usedDirectMemory keep high and does not decrease. The worker node is permanently blacklisted and cannot be used.

This problem has been bothering us for a long time. When the thred cache is turned off, in fact, after ctx.channel().config().setAutoRead(false), the netty framework will still hold some ByteBufs. This part of ByteBuf result in a lot of PoolChunks cannot be released.

In netty, if a chunk is 16M and 8k of this chunk has been allocated, then the pinnedMemory is 8k and the activeMemory is 16M. The remaining (16M-8k) memory can be allocated, but not yet allocated, netty allocates and releases memory in chunk units, so the 8k that has been allocated will result in 16M that cannot be returned to the operating system.

Here are some scenes from our production/test environment:

We config 10gb off-heap memory for worker, other configs as below:

celeborn.network.memory.allocator.allowCache                         false
celeborn.worker.monitor.memory.check.interval                         100ms
celeborn.worker.monitor.memory.report.interval                        10s
celeborn.worker.directMemoryRatioToPauseReceive                       0.75
celeborn.worker.directMemoryRatioToPauseReplicate                     0.85
celeborn.worker.directMemoryRatioToResume                             0.5

When receiving high traffic, the worker's usedDirectMemory increases. After triggering trim and pause, usedDirectMemory still does not reach the resume threshold, and worker was excluded.

So we checked the heap snapshot of the abnormal worker, we can see that there are a large number of DirectByteBuffers in the heap memory. These DirectByteBuffers are all 4mb in size, which is exactly the size of chunksize. According to the path to gc root, DirectByteBuffer is held by PoolChunk, and these 4m only have 160k pinnedBytes.

There are many ByteBufs that are not released

The stack shows that these ByteBufs are allocated by netty

We tried to reproduce this situation in the test environment. When the same problem occurred, we added a restful api of the worker to force the worker to resume. After the resume, the worker returned to normal, and PushDataHandler handled many delayed requests.

So I think that when pinnedMemory is not high enough, we should not trigger pause and congestion, because at this time a large part of the memory can still be allocated.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

leixm · 2024-12-20T10:12:52Z

Before optimization

After optimization

FMX · 2024-12-21T03:06:07Z

I have a question. Will the worker OOM when the pinned memory is high? In our previous implementations, the direct memory counter will count all direct memory allocated, whether or not the allocator can allocate.
Can you do a pressure test for this scenario?

### What changes were proposed in this pull request? This PR introduces a configuration `celeborn.network.memory.allocator.pooled` to allow users to disable `PooledByteBufAllocator` globally and always use `UnpooledByteBufAllocator`. ### Why are the changes needed? In some extreme cases, the Netty's `PooledByteBufAllocator` might have tons of 4MiB chunks but only a few sizes of the capacity are used by the real data(see #3018), for scenarios that stability is important than performance, it's desirable to allow users to disable the `PooledByteBufAllocator` globally. ### Does this PR introduce _any_ user-facing change? Add a new feature, disabled by default. ### How was this patch tested? Pass UT to ensure correctness. Performance and memory impact need to be verified in the production scale cluster. Closes #3043 from pan3793/CELEBORN-1815. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

RexXiong · 2025-01-03T03:20:06Z

I discussed this with FMX offline, and we believe it might be better to use pinnedDirectMemory for worker resumption while keeping usedDirectMemory for trimming. @leixm @AngersZhuuuu WDYT?

leixm · 2025-01-03T03:41:45Z

I discussed this with FMX offline, and we believe it might be better to use pinnedDirectMemory for worker resumption while keeping usedDirectMemory for trimming. @leixm @AngersZhuuuu WDYT?

It's okay for me. @AngersZhuuuu WDYT?

FMX · 2025-01-03T05:43:07Z

Don't forget to update the default value of this config "celeborn.worker.directMemoryRatioToResume" to 0.3.

…stead of usedDirectMemory

RexXiong · 2025-01-13T06:25:06Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

+      allocatedMemory = memoryUsage;
+    }
+    // trigger resume
+    // CELEBORN-1792: resume should use pinnedDirectMemory instead of usedDirectMemory


Although we needn't change to pause state, it would be better to call trim when netty direct memory used above pausePushDataThreshold/pauseReplicateThreshold, WDYT?

RexXiong · 2025-01-13T06:26:12Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -3858,7 +3858,7 @@ object CelebornConf extends Logging {
      .doc("If direct memory usage is less than this limit, worker will resume.")
      .version("0.2.0")
      .doubleConf
-      .createWithDefault(0.7)
+      .createWithDefault(0.3)


Maybe we can add a new conf for pinnedMemoryToResume and keep exist conf for directMemoryRatioToResume

RexXiong

btw, Can we do some test for this?

RexXiong · 2025-01-18T04:17:40Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

+    if (pinnedMemoryCheckEnabled
+        && System.currentTimeMillis() >= pinnedMemoryNextCheckTime
+        && getAllocatedMemory() / (double) (maxDirectMemory) < pinnedMemoryResumeRatio) {
+      pinnedMemoryNextCheckTime += pinnedMemoryCheckInterval;


pinnedMemoryNextCheckTime compute seems incorrect, We can Use System.currentTimeMillis() as last checkTime, then we can use System.currentTimeMillis()-lastCheckTime >= pinnedMemoryCheckInterval to check whether need resume.

FMX · 2025-01-21T03:02:14Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

@@ -93,6 +96,9 @@ public class MemoryManager {
  private long memoryFileStorageThreshold;
  private final LongAdder memoryFileStorageCounter = new LongAdder();
  private final StorageManager storageManager;
+  private boolean pinnedMemoryCheckEnabled;
+  private long pinnedMemoryCheckInterval;
+  private long pinnedMemoryLastCheckTime = 0L;


The default value for this is 0.

FMX · 2025-01-21T03:03:37Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

@@ -282,7 +292,7 @@ private MemoryManager(CelebornConf conf, StorageManager storageManager, Abstract
        Utils.bytesToString(readBufferThreshold),
        Utils.bytesToString(readBufferTarget),
        Utils.bytesToString(memoryFileStorageThreshold),
-        resumeRatio);
+        directMemoryResumeRatio);


You can add pinned memory resume ratio here. It is an important parameter for memory manager.

FMX · 2025-01-21T03:08:48Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

@@ -436,6 +445,16 @@ public long getMemoryUsage() {
    return getNettyUsedDirectMemory() + sortMemoryCounter.get();
  }

+  public long getAllocatedMemory() {


This method should be renamed to getPinnedMemory. The allocated memory is the netty memory counter.

FMX · 2025-01-21T03:11:37Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

@@ -93,6 +96,9 @@ public class MemoryManager {
  private long memoryFileStorageThreshold;
  private final LongAdder memoryFileStorageCounter = new LongAdder();
  private final StorageManager storageManager;
+  private boolean pinnedMemoryCheckEnabled;
+  private long pinnedMemoryCheckInterval;


To avoid frequently calling a pinned memory counter, I think you can cache the last pinned memory value and refresh it periodically. And exporting the pinned memory value to the metrics.

Here is another PR introducing pinnedMemory metrics #3019

getPinnedMemory is not called very frequently. It is called once every pinnedMemoryCheckInterval. The default is 10 seconds.

RexXiong · 2025-01-21T07:47:34Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

        logger.debug("Trigger action: TRIM");
-        trimCounter += 1;
-        // force to append pause spent time even we are in pause state
+        trimAllListeners();
        if (trimCounter >= forceAppendPauseSpentTimeThreshold) {


lost trimCounter+=1

RexXiong · 2025-01-21T07:47:42Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java

-        memoryPressureListeners.forEach(
-            memoryPressureListener ->
-                memoryPressureListener.onPause(TransportModuleConstants.REPLICATE_MODULE));
+        logger.debug("Trigger action: TRIM");
        trimAllListeners();


RexXiong

LGTM, thanks @leixm

RexXiong · 2025-01-22T06:31:09Z

Thanks, merge to main(v0.6.0)

pan3793 mentioned this pull request Dec 31, 2024

[CELEBORN-1815] Support UnpooledByteBufAllocator #3043

Closed

leixm changed the title ~~[CELEBORN-1792] Congestion and MemoryManager should use pinnedDirectMemory instead of usedDirectMemory~~ [CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory Jan 12, 2025

[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory in…

5fdc844

…stead of usedDirectMemory

leixm force-pushed the CELEBORN-1792 branch from f3e1ce1 to 5fdc844 Compare January 12, 2025 14:34

leixm added 2 commits January 12, 2025 23:29

fix

e149e86

fix

4fa3612

RexXiong reviewed Jan 13, 2025

View reviewed changes

leixm added 7 commits January 13, 2025 17:30

fix

ea78183

fix

354515d

fix

eb0d635

fix

3567fb4

fix

dcd3596

fix

2d5c9b9

fix

91ea4d7

RexXiong reviewed Jan 18, 2025

View reviewed changes

fix

27f88d3

leixm force-pushed the CELEBORN-1792 branch from 838c823 to 27f88d3 Compare January 20, 2025 06:36

leixm added 5 commits January 20, 2025 20:12

fix

c1439cb

fix

135d5a8

fix

abacec6

fix

e8412f0

fix

366ff59

FMX reviewed Jan 21, 2025

View reviewed changes

RexXiong reviewed Jan 21, 2025

View reviewed changes

leixm added 2 commits January 21, 2025 20:40

fix

e7e7479

fix

e2de154

RexXiong approved these changes Jan 21, 2025

View reviewed changes

RexXiong closed this in 9131c1e Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory #3018

[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory #3018

leixm commented Dec 20, 2024 •

edited

Loading

leixm commented Dec 20, 2024

FMX commented Dec 21, 2024

RexXiong commented Jan 3, 2025

leixm commented Jan 3, 2025

FMX commented Jan 3, 2025

RexXiong Jan 13, 2025

RexXiong Jan 13, 2025

RexXiong left a comment

RexXiong Jan 18, 2025

FMX Jan 21, 2025

FMX Jan 21, 2025

FMX Jan 21, 2025

FMX Jan 21, 2025

leixm Jan 21, 2025

leixm Jan 21, 2025

RexXiong Jan 21, 2025

RexXiong Jan 21, 2025

RexXiong left a comment

RexXiong commented Jan 22, 2025

[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory #3018

[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory #3018

Conversation

leixm commented Dec 20, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

leixm commented Dec 20, 2024

FMX commented Dec 21, 2024

RexXiong commented Jan 3, 2025

leixm commented Jan 3, 2025

FMX commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RexXiong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RexXiong left a comment

Choose a reason for hiding this comment

RexXiong commented Jan 22, 2025

leixm commented Dec 20, 2024 •

edited

Loading