[CELEBORN-894] Add checksum for shuffle data #2979

jiang13021 · 2024-12-08T06:00:14Z

What changes were proposed in this pull request?

Added a checksum mechanism to the shuffle data header.
Implemented checksum verification while writing data in the worker.
Integrated checksum validation during data reading in CelebornInputStreamImpl.
Added a PushDataHeaderUtils to manage the shuffle data header.

Why are the changes needed?

Sometimes, the transmitted data may change, leading to errors. Therefore, it is necessary to add a checksum to detect this exception. I only added a checksum to the header because I believe that in a production environment, the data body is mostly compressed, and during compression, a checksum is typically generated. Therefore, in scenarios where compression is enabled, adding a checksum to the header is enough.

Does this PR introduce any user-facing change?

No, this change is compatible.

How was this patch tested?

unit test: org.apache.celeborn.service.deploy.cluster.PushDataWithChecksumSuite

codenohup

Hi, @jiang13021
Thanks for your contribution!
I have two questions, PTAL.

codenohup · 2024-12-09T06:21:50Z

...common/src/main/java/org/apache/celeborn/plugin/flink/readclient/FlinkShuffleClientImpl.java

@@ -79,6 +76,7 @@ public class FlinkShuffleClientImpl extends ShuffleClientImpl {
  private ConcurrentHashMap<String, TransportClient> currentClient =
      JavaUtils.newConcurrentHashMap();
  private long driverTimestamp;
+  private final int BATCH_HEADER_SIZE = 4 * 4;


Does this feature currently support only Spark? Will flink be supported in the future?

Yes. Currently, this feature does not support Flink, but I will submit another PRs in the future to add support. The current changes should be compatible with Flink.

codenohup · 2024-12-09T06:24:46Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionDataWriter.java

@@ -293,14 +293,15 @@ public void flush(boolean finalFlush, boolean fromEvict) throws IOException {
          // read flush buffer to generate correct chunk offsets
          // data header layout (mapId, attemptId, nextBatchId, length)
          if (numBytes > chunkSize) {
-            ByteBuffer headerBuf = ByteBuffer.allocate(16);
+            ByteBuffer headerBuf = ByteBuffer.allocate(PushDataHeaderUtils.BATCH_HEADER_SIZE);


If a worker receives a data buffer from the an old version client or not supported engine client, will this modification yield accurate results?

If a worker receives a data buffer from an old version client, we will check the checksum flag in the highest bit of batchId for compatibility. Similarly, if the data comes from an unsupported engine's client but follows the old version data header layout, it will also be compatible.

...common/src/main/java/org/apache/celeborn/plugin/flink/readclient/FlinkShuffleClientImpl.java

FMX · 2024-12-09T08:04:23Z

...common/src/main/java/org/apache/celeborn/plugin/flink/readclient/FlinkShuffleClientImpl.java

@@ -79,6 +76,7 @@ public class FlinkShuffleClientImpl extends ShuffleClientImpl {
  private ConcurrentHashMap<String, TransportClient> currentClient =
      JavaUtils.newConcurrentHashMap();
  private long driverTimestamp;
+  private final int BATCH_HEADER_SIZE = 4 * 4;


FMX · 2024-12-09T08:25:31Z

common/src/main/java/org/apache/celeborn/common/util/PushDataHeaderUtils.java

+  }
+
+  public static int getLength(byte[] data) {
+    return Platform.getInt(data, LENGTH_OFFSET) - 4;


This will cause errors if an old client reads from Celeborn workers with this feature.

+1, we should think about how maintain compatibility with older versions.

@FMX @zwangsheng Thank you for ~~reviving~~ reviewing this, we will check the checksum flag in the highest bit of batchId for compatibility. I have update the code here, please take another look.

FMX · 2024-12-09T08:28:18Z

common/src/main/java/org/apache/celeborn/common/util/PushDataHeaderUtils.java

+  public static final int LENGTH_OFFSET = Platform.BYTE_ARRAY_OFFSET + 12;
+  public static final int CHECKSUM_OFFSET = Platform.BYTE_ARRAY_OFFSET + 16;
+  public static final int POSITIVE_MASK = 0x7FFFFFFF;
+  public static final int HIGHEST_1_BIT_FLAG_MASK = 0x80000000;


I agree with this design that the batchId can not be a negative number. But this class needs to be compatible with the old clients. During cluster upgrade, there will be a moment when old clients are talking to new servers.

Worker will check the checksum flag in the highest bit of batchId for compatibility.

FMX · 2024-12-09T08:32:25Z

common/src/main/java/org/apache/celeborn/common/util/PushDataHeaderUtils.java

+  public static int computeHeaderChecksum32(byte[] data) {
+    assert data.length >= BATCH_HEADER_SIZE_WITHOUT_CHECKSUM;
+    CRC32 crc32 = new CRC32();
+    crc32.update(data, 0, BATCH_HEADER_SIZE_WITHOUT_CHECKSUM);


Although 16-byte calculation for CRC can be trivial, there can be enormous push data structures to handle for a Celeborn Worker. I think it would be better to add a switch to let the users enable or disable this feature. For some users, the never meet any issues about this, they can just disable this feature to save the CPU.

I have received some reports from users telling me that Celeborn workers are consuming too much CPU during rush hours. If this feature is on by default, this will surely get things worse.
You can make this config a client-side control config and pass it in the ReserveSlots request.

OK, I will add a config this week.

@FMX Hi, I have added a config to disable header checksum, PTAL

CRC32 is not the fastest choice, I haven't gone through the whole design, and not sure if the algo can be configured like Spark's impl.
apache/spark#47929 apache/spark#49258

@jiang13021 Seems CRC32C is an optimized version, I think this can do done in following pr, WDYT?

@RexXiong the current design does not have extra space to carry the algorithm type, we should choose the most efficient and future-proof one.
BTW, I found Kafka also use CRC32C in Message format v2

To answer @pan3793's query - there are a bunch of impl's which are pretty fast, and yet give good error detection. Adler32 is not as robust from pov of error detection IIRC - murmurhash or xxhash are good candidates.

From my understanding, this is primarily just for header and not data, right ? In which case, initialization overhead, cost would be dimensions to consider - especially pure java based impl's might help.

this is primarily just for header and not data, right ?

the method name is misleading, it calculates data checksum and stores the result in the header

@pan3793 @RexXiong Perhaps we could utilize the highest 4 bits of the checksum to indicate the algorithm type, reserving the remaining 28 bits for the checksum itself. This approach would allow us the flexibility to select different algorithm. WDYT?

...er/src/test/scala/org/apache/celeborn/service/deploy/cluster/PushDataWithChecksumSuite.scala

zwangsheng · 2024-12-09T08:50:34Z

common/src/main/java/org/apache/celeborn/common/util/PushDataHeaderUtils.java

+  }
+
+  public static int getLength(byte[] data) {
+    return Platform.getInt(data, LENGTH_OFFSET) - 4;


+1, we should think about how maintain compatibility with older versions.

RexXiong · 2024-12-13T08:00:45Z

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/PushDataHandler.scala

@@ -1470,7 +1470,24 @@ class PushDataHandler(val workerSource: WorkerSource) extends BaseMessageHandler
        shuffleKey: String,
        index: Int): Unit = {
      try {
-        fileWriter.write(body)
+        val header = new Array[Byte](PushDataHeaderUtils.BATCH_HEADER_SIZE_WITHOUT_CHECKSUM)


We could consider implementing a new method for verifying data checksums. If a checksum validation fails, we would throw a CelebornChecksumException. This approach would easier for ut testing. Additionally, we can handle this exception within the catch block alongside other exceptions.

Fixed, thank you.

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

FMX

LGTM.

RexXiong

LGTM

mridulm

Unfortunately I have not been able to do a very close read of the PR, but trying to understand - is this not an incompatible change which will require all clients and server side to be upgraded in lock step ?

The reason for query - when @otterc and I were initially designing the TLS/authn, we had discussed in depth whether we should have a 'declared features' mechanism - for clients and servers to advertise and negotiate what is supported : given what we observed to be inflexibility with other shuffle protocols.
We did not add it at that time for Celeborn given the complexity (and lack of immediate need) - but I am wondering if we are having more usecases to be supported.

pan3793 · 2025-01-06T07:01:59Z

@mridulm the current impl allows the old client to communicate with the newer server, because it uses the highest bit of an int in the chuck header(which represents the data size, so the highest bit always is zero) to indicate whether the checksum is enabled, so old clients always perform as checksum disabled. this design is backward compatible and efficient, but also means we can not ship checksum algo info, so choosing an efficient algo is also important.

RexXiong · 2025-01-06T09:51:18Z

After a deep check Flink's implementation, I found that the header/data written by Flink is in big-endian format, while the header written by Spark using Platform(Unsafe) may be in little-endian format. This could lead to compatibility issues when the server reads the header in the same way. Possible solutions include selecting the appropriate way to read the data based on the PartitionType or adding a new field, byteOrder, to explicitly inform the server of the endianness used in writing. Personally I prefer the first solution. also cc @reswqa @codenohup

reswqa · 2025-01-07T03:32:00Z

Possible solutions include selecting the appropriate way to read the data based on the PartitionType or adding a new field, byteOrder, to explicitly inform the server of the endianness used in writing. Personally I prefer the first solution.

read data based on PartitionType sounds good to me.

mridulm · 2025-01-07T09:25:26Z

@pan3793 thanks for clarifying, will go over the PR later this week - have been a bit swamped lately.
My general comment would be - nothing stands the test of time :-) So having the ability to evolve the checksum algo would help long term !

codenohup reviewed Dec 9, 2024

View reviewed changes

FMX reviewed Dec 9, 2024

View reviewed changes

zwangsheng reviewed Dec 9, 2024

View reviewed changes

RexXiong reviewed Dec 13, 2024

View reviewed changes

pan3793 reviewed Dec 18, 2024

View reviewed changes

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated Show resolved Hide resolved

pan3793 reviewed Dec 18, 2024

View reviewed changes

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated Show resolved Hide resolved

[CELEBORN-894] Add config to disable checksum

f719b49

jiang13021 force-pushed the celeborn_894_checksum branch from 2eef55a to f719b49 Compare December 18, 2024 11:03

fix

a43e6a7

FMX approved these changes Dec 24, 2024

View reviewed changes

RexXiong approved these changes Jan 2, 2025

View reviewed changes

mridulm reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-894] Add checksum for shuffle data #2979

[CELEBORN-894] Add checksum for shuffle data #2979

jiang13021 commented Dec 8, 2024

codenohup left a comment

codenohup Dec 9, 2024

jiang13021 Dec 9, 2024

codenohup Dec 9, 2024

jiang13021 Dec 9, 2024

FMX Dec 9, 2024

FMX Dec 9, 2024

zwangsheng Dec 9, 2024

jiang13021 Dec 9, 2024 •

edited

Loading

FMX Dec 9, 2024

jiang13021 Dec 9, 2024

FMX Dec 9, 2024

FMX Dec 9, 2024

jiang13021 Dec 9, 2024

jiang13021 Dec 18, 2024

pan3793 Dec 18, 2024 •

edited

Loading

RexXiong Jan 2, 2025

pan3793 Jan 6, 2025

mridulm Jan 6, 2025 •

edited

Loading

pan3793 Jan 6, 2025

jiang13021 Jan 7, 2025

zwangsheng Dec 9, 2024

RexXiong Dec 13, 2024

jiang13021 Dec 18, 2024

FMX left a comment

RexXiong left a comment

mridulm left a comment •

edited

Loading

pan3793 commented Jan 6, 2025

RexXiong commented Jan 6, 2025

reswqa commented Jan 7, 2025

mridulm commented Jan 7, 2025

[CELEBORN-894] Add checksum for shuffle data #2979

Are you sure you want to change the base?

[CELEBORN-894] Add checksum for shuffle data #2979

Conversation

jiang13021 commented Dec 8, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codenohup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiang13021 Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FMX left a comment

Choose a reason for hiding this comment

RexXiong left a comment

Choose a reason for hiding this comment

mridulm left a comment • edited Loading

Choose a reason for hiding this comment

pan3793 commented Jan 6, 2025

RexXiong commented Jan 6, 2025

reswqa commented Jan 7, 2025

mridulm commented Jan 7, 2025

jiang13021 Dec 9, 2024 •

edited

Loading

pan3793 Dec 18, 2024 •

edited

Loading

mridulm Jan 6, 2025 •

edited

Loading

mridulm left a comment •

edited

Loading