Result mismatch with vanilla spark in hash function with decimal input #1294

wForget · 2025-01-16T12:22:20Z

Describe the bug

Result mismatch with vanilla spark in hash function with decimal input.

Steps to reproduce

Case 1: `precision <= 18`

When precision is less than or equal to 18, spark use unscaled long value of decimal to calculate hash.

https://github.com/apache/spark/blob/d12bb23e49928ae36b57ac5fbeaa492dadf7d28f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L560-L567

test:

sql(s"create table t1(c1 decimal(18, 2)) using parquet")
sql(s"insert into t1 values(1.23), (-1.23), (0.0), (null)")
checkSparkAnswerAndOperator(s"select c1, hash(c1), xxhash64(c1) from t1 order by c1")

result:

!== Correct Answer - 4 ==                                    == Spark Answer - 4 ==
 struct<c1:decimal(18,2),hash(c1):int,xxhash64(c1):bigint>   struct<c1:decimal(18,2),hash(c1):int,xxhash64(c1):bigint>
 [null,42,42]                                                [null,42,42]
![-1.23,1993430267,-2573642654262063070]                     [-1.23,1891444694,1620882180304245068]
![0.00,-1670924195,-5252525462095825812]                     [0.00,-300363099,-3510225485978208079]
![1.23,-46242105,-3178482946328430151]                       [1.23,93356439,7126273675055069777]

Case 2: `precision > 18`

When precision is greater than 18, spark use bytes of unscaled value to calculate hash. But the results are still inconsistent.

test:

sql(s"create table t2(c1 decimal(20, 2)) using parquet")
sql(s"insert into t2 values(1.23), (-1.23), (0.0), (null)")
checkSparkAnswerAndOperator(s"select c1, hash(c1), xxhash64(c1) from t2 order by c1")

result:

!== Correct Answer - 4 ==                                    == Spark Answer - 4 ==
 struct<c1:decimal(20,2),hash(c1):int,xxhash64(c1):bigint>   struct<c1:decimal(20,2),hash(c1):int,xxhash64(c1):bigint>
 [null,42,42]                                                [null,42,42]
![-1.23,1285747285,3359051433950298639]                      [-1.23,1891444694,1620882180304245068]
![0.00,-783713497,-8959994473701255385]                      [0.00,-300363099,-3510225485978208079]
![1.23,-1536290115,1973765065063347049]                      [1.23,93356439,7126273675055069777]

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2025-01-22T00:23:16Z

I found more differences through fuzz testing. Query is SELECT a, xxhash64(a) for Decimal(36,18) input.

!== Correct Answer - 100 ==                      == Spark Answer - 100 ==
 struct<c8:decimal(36,18),xxhash64(c8):bigint>   struct<c8:decimal(36,18),xxhash64(c8):bigint>
![0.083846911329507723,-7768183730488634936]     [0.083846911329507723,7253766342851001266]
![0.084855353024058666,-1313703402582021561]     [0.084855353024058666,-509567910895807844]
![0.087554197347941343,6265272281954134702]      [0.087554197347941343,-6169978753261918548]

andygrove · 2025-01-22T14:35:59Z

We should consider falling back to Spark for decimal types that we know cause incorrect results

wForget added the bug Something isn't working label Jan 16, 2025

This was referenced Jan 16, 2025

fix: partially fix consistency issue of hash functions with decimal input #1295

Merged

Bugs found by fuzz testing wForget/fuzz-test-spark-native#7

Open

andygrove mentioned this issue Jan 22, 2025

fix: Fall back to Spark when hashing decimals with precision > 18 #1325

Merged

andygrove closed this as completed in #1325 Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result mismatch with vanilla spark in hash function with decimal input #1294

Result mismatch with vanilla spark in hash function with decimal input #1294

wForget commented Jan 16, 2025 •

edited

Loading

andygrove commented Jan 22, 2025

andygrove commented Jan 22, 2025

Result mismatch with vanilla spark in hash function with decimal input #1294

Result mismatch with vanilla spark in hash function with decimal input #1294

Comments

wForget commented Jan 16, 2025 • edited Loading

Describe the bug

Steps to reproduce

Case 1: precision <= 18

Case 2: precision > 18

Expected behavior

Additional context

andygrove commented Jan 22, 2025

andygrove commented Jan 22, 2025

wForget commented Jan 16, 2025 •

edited

Loading

Case 1: `precision <= 18`

Case 2: `precision > 18`