-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not rename struct fields when coercing types in CASE
#14384
Conversation
CASE
Some(Struct(fields.into())) | ||
} | ||
_ => None, | ||
} | ||
} | ||
|
||
/// returns the result of coercing two fields to a common type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the fix. I suspect this doesn't matter in most cases as the result of a comparison is boolean
so the names of the intermediate fields don't matter.
However, in a CASE
(and union for that matter) the names do matter, so this PR preserves them
{ 'xxx': 'e' } -- column4: Struct with field xxx (no second field) | ||
); | ||
|
||
# Note field names are in different orders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test documents some strange behavior that I don't really know if is a bug or not
DataFusion treats the field orders as important, and when coercing two structs, it does it in field order (not by name)
So in this case { 'foo': 'a', 'xxx': 'b' }
, and { 'xxx': 'c', 'foo': 'd' }
are coerced to compare the values a
/c
and b
/ d
(not the values a
/d
and b
/c
This PR doesn't change this behavior, but I felt that was an important behavior to document in tests, so I did so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- While working on Fix regression list Type Coercion List with inner type struct which has large/view types #14385 I found that this behaior behavior (using struct position rather than field name) is inconsistent
Specifically the basic case comparison goes through "comparison_coercion" (fixed in this PR) which uses the list field order
But when coercing list elements, that goes through type_union_resolution
, (link) which matches fields by name (link)
@jayzhan211 is the eventual plan to unify thee code paths? Or maybe the case coercion should use type_union_resolution
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We follow how DuckDB works, I think both field order and name should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear this PR doesn't change the existing behavior of comparison_coercion other than not renaming field names
@jayzhan211 as our resident coercion expert can you possibly review this bug? It is similar to #14385 |
/// returns the result of coercing two fields to a common type | ||
fn coerce_fields(common_type: DataType, lhs: &FieldRef, rhs: &FieldRef) -> FieldRef { | ||
let is_nullable = lhs.is_nullable() || rhs.is_nullable(); | ||
let name = lhs.name(); // pick the name from the left field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the name is not the same, return error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that would be better -- can I do it in a follow on PR?
In my opinion this PR makes things better (doesn't lose field names) even though it is not perfect and I would like to fix the regression since 43
What I think we should do is switch comparison_coercion
to use type_union_coercion
eventually and then implement the correct coercion there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think type union resolution is the correct on for CASE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed
to track the issue and plan to merge this one in when ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not renaming the fields looks like a good improvement to me. Thanks @alamb.
Thank you @andygrove What I plan to do is to file a follow on ticket to address @jayzhan211 's (very valid) concern that the coercion should error if the fields have different names and merge this PR in tomorrow |
* Fix field name during struct equality coercion * fix bug * Add more tests * Update tests per apache#14396
Which issue does this PR close?
Case
coercion of Structs loses field names #14383Rationale for this change
When coercing fields for a struct, we shouldn't rename the fields to
c0
,c1
, etc. See #14383What changes are included in this PR?
Preserve the Struct field names
Are these changes tested?
Yes, sqllogictests are added
Are there any user-facing changes?
coerced struct field names do not revever to
c0
, `c1, etc