Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not rename struct fields when coercing types in CASE #14384

Merged
merged 5 commits into from
Feb 1, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 31, 2025

Which issue does this PR close?

Rationale for this change

When coercing fields for a struct, we shouldn't rename the fields to c0, c1, etc. See #14383

What changes are included in this PR?

Preserve the Struct field names

Are these changes tested?

Yes, sqllogictests are added

Are there any user-facing changes?

coerced struct field names do not revever to c0, `c1, etc

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 31, 2025
@alamb alamb changed the title Alamb/fix field coercion Do not rename struct fields when coercing types in CASE Jan 31, 2025
Some(Struct(fields.into()))
}
_ => None,
}
}

/// returns the result of coercing two fields to a common type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the fix. I suspect this doesn't matter in most cases as the result of a comparison is boolean so the names of the intermediate fields don't matter.

However, in a CASE (and union for that matter) the names do matter, so this PR preserves them

{ 'xxx': 'e' } -- column4: Struct with field xxx (no second field)
);

# Note field names are in different orders
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test documents some strange behavior that I don't really know if is a bug or not

DataFusion treats the field orders as important, and when coercing two structs, it does it in field order (not by name)

So in this case { 'foo': 'a', 'xxx': 'b' }, and { 'xxx': 'c', 'foo': 'd' } are coerced to compare the values a /c and b / d (not the values a/d and b/c

This PR doesn't change this behavior, but I felt that was an important behavior to document in tests, so I did so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically the basic case comparison goes through "comparison_coercion" (fixed in this PR) which uses the list field order

But when coercing list elements, that goes through type_union_resolution, (link) which matches fields by name (link)

@jayzhan211 is the eventual plan to unify thee code paths? Or maybe the case coercion should use type_union_resolution 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We follow how DuckDB works, I think both field order and name should be the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear this PR doesn't change the existing behavior of comparison_coercion other than not renaming field names

@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2025

@jayzhan211 as our resident coercion expert can you possibly review this bug?

It is similar to #14385

/// returns the result of coercing two fields to a common type
fn coerce_fields(common_type: DataType, lhs: &FieldRef, rhs: &FieldRef) -> FieldRef {
let is_nullable = lhs.is_nullable() || rhs.is_nullable();
let name = lhs.name(); // pick the name from the left field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the name is not the same, return error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be better -- can I do it in a follow on PR?

In my opinion this PR makes things better (doesn't lose field names) even though it is not perfect and I would like to fix the regression since 43

What I think we should do is switch comparison_coercion to use type_union_coercion eventually and then implement the correct coercion there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think type union resolution is the correct on for CASE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed

to track the issue and plan to merge this one in when ready

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not renaming the fields looks like a good improvement to me. Thanks @alamb.

@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2025

Not renaming the fields looks like a good improvement to me. Thanks @alamb.

Thank you @andygrove

What I plan to do is to file a follow on ticket to address @jayzhan211 's (very valid) concern that the coercion should error if the fields have different names and merge this PR in tomorrow

@alamb alamb merged commit c33cb85 into apache:main Feb 1, 2025
25 checks passed
@alamb alamb deleted the alamb/fix_field_coercion branch February 1, 2025 12:10
cj-zhukov pushed a commit to cj-zhukov/datafusion that referenced this pull request Feb 3, 2025
* Fix field name during struct equality coercion

* fix bug

* Add more tests

* Update tests per apache#14396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Case coercion of Structs loses field names
3 participants