Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result mismatch with vanilla spark in hash function with decimal input #1294

Closed
wForget opened this issue Jan 16, 2025 · 2 comments · Fixed by #1325
Closed

Result mismatch with vanilla spark in hash function with decimal input #1294

wForget opened this issue Jan 16, 2025 · 2 comments · Fixed by #1325
Labels
bug Something isn't working

Comments

@wForget
Copy link
Member

wForget commented Jan 16, 2025

Describe the bug

Result mismatch with vanilla spark in hash function with decimal input.

Steps to reproduce

Case 1: precision <= 18

When precision is less than or equal to 18, spark use unscaled long value of decimal to calculate hash.

https://github.com/apache/spark/blob/d12bb23e49928ae36b57ac5fbeaa492dadf7d28f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L560-L567

test:

sql(s"create table t1(c1 decimal(18, 2)) using parquet")
sql(s"insert into t1 values(1.23), (-1.23), (0.0), (null)")
checkSparkAnswerAndOperator(s"select c1, hash(c1), xxhash64(c1) from t1 order by c1")

result:

!== Correct Answer - 4 ==                                    == Spark Answer - 4 ==
 struct<c1:decimal(18,2),hash(c1):int,xxhash64(c1):bigint>   struct<c1:decimal(18,2),hash(c1):int,xxhash64(c1):bigint>
 [null,42,42]                                                [null,42,42]
![-1.23,1993430267,-2573642654262063070]                     [-1.23,1891444694,1620882180304245068]
![0.00,-1670924195,-5252525462095825812]                     [0.00,-300363099,-3510225485978208079]
![1.23,-46242105,-3178482946328430151]                       [1.23,93356439,7126273675055069777]

Case 2: precision > 18

When precision is greater than 18, spark use bytes of unscaled value to calculate hash. But the results are still inconsistent.

test:

sql(s"create table t2(c1 decimal(20, 2)) using parquet")
sql(s"insert into t2 values(1.23), (-1.23), (0.0), (null)")
checkSparkAnswerAndOperator(s"select c1, hash(c1), xxhash64(c1) from t2 order by c1")

result:

!== Correct Answer - 4 ==                                    == Spark Answer - 4 ==
 struct<c1:decimal(20,2),hash(c1):int,xxhash64(c1):bigint>   struct<c1:decimal(20,2),hash(c1):int,xxhash64(c1):bigint>
 [null,42,42]                                                [null,42,42]
![-1.23,1285747285,3359051433950298639]                      [-1.23,1891444694,1620882180304245068]
![0.00,-783713497,-8959994473701255385]                      [0.00,-300363099,-3510225485978208079]
![1.23,-1536290115,1973765065063347049]                      [1.23,93356439,7126273675055069777]

Expected behavior

No response

Additional context

No response

@andygrove
Copy link
Member

I found more differences through fuzz testing. Query is SELECT a, xxhash64(a) for Decimal(36,18) input.

!== Correct Answer - 100 ==                      == Spark Answer - 100 ==
 struct<c8:decimal(36,18),xxhash64(c8):bigint>   struct<c8:decimal(36,18),xxhash64(c8):bigint>
![0.083846911329507723,-7768183730488634936]     [0.083846911329507723,7253766342851001266]
![0.084855353024058666,-1313703402582021561]     [0.084855353024058666,-509567910895807844]
![0.087554197347941343,6265272281954134702]      [0.087554197347941343,-6169978753261918548]

@andygrove
Copy link
Member

We should consider falling back to Spark for decimal types that we know cause incorrect results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants