chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) #1332

andygrove · 2025-01-23T17:07:14Z

Which issue does this PR close?

Part of #1304

Rationale for this change

Use latest DF in preparation for upgrading to DF 45.

What changes are included in this PR?

Bump DF version
Copy over latest FilterExec and re-apply Comet-specific changes (we could stop doing this if we just unpack all dictionaries in the scan)
Remove uses of Field::new_dict

How are these changes tested?

native/core/src/execution/operators/filter.rs

andygrove · 2025-01-23T18:25:05Z

native/core/src/execution/operators/scan.rs

@@ -304,11 +304,7 @@ fn scan_schema(input_batch: &InputBatch, data_types: &[DataType]) -> SchemaRef {
                .map(|(idx, c)| {
                    let datatype = ScanExec::unpack_dictionary_type(c.data_type());
                    // We don't use the field name. Put a placeholder.
-                    if matches!(datatype, DataType::Dictionary(_, _)) {
-                        Field::new_dict(format!("col_{}", idx), datatype, true, idx as i64, false)


It is no longer possible to re-use dictionary id across fields. I am unsure of the impact here. Perhaps @viirya will know.

What you mean to re-use dictionary id across fields? Dictionary id should be unique per field.

Oh, corrected it. If two fields have same dictionary, they may use same dictionary id.

I saw you resolved this. Is it not an issue now?

Is it possible to do some micro benchmarks on this?

There is no performance impact from this change. The original code stored a dictionary id in the metadata and the new code does not. This dictionary id is actually not used at all in FFI. It was used in Arrow IPC but is no longer used as of Arrow 54.0.0 because that feature is now removed and Arrow IPC manages its own dictionary ids. We do not use Arrow IPC now because we are using our own proprietary encoding.

I will go ahead and run another TPC-H benchmark and post results here today, just to confirm there are no regressions.

Just making sure, will this work even the enableFastEncoding option is disabled?

Fresh benchmark results:

Using fast encoding = 332 seconds (our published time is 331, and I do see small variance on each run)
Using Arrow IPC = 334 seconds

native/core/src/execution/shuffle/row.rs

andygrove · 2025-01-23T20:38:44Z

Tests are failing:

 Cause: org.apache.comet.CometNativeException: slice index starts at 18446744072774451440 but ends at 32720
[info]         at comet::errors::init::{{closure}}(__internal__:0)
[info]         at std::panicking::rust_panic_with_hook(__internal__:0)
[info]         at std::panicking::begin_panic_handler::{{closure}}(__internal__:0)
[info]         at std::sys::backtrace::__rust_end_short_backtrace(__internal__:0)
[info]         at rust_begin_unwind(__internal__:0)
[info]         at core::panicking::panic_fmt(__internal__:0)
[info]         at core::slice::index::slice_index_order_fail(__internal__:0)
[info]         at arrow_data::transform::variable_size::build_extend::{{closure}}(__internal__:0)
[info]         at arrow_data::transform::MutableArrayData::extend(__internal__:0)
[info]         at arrow_select::concat::concat_fallback(__internal__:0)
[info]         at arrow_select::concat::concat(__internal__:0)
[info]         at arrow_select::concat::concat_batches(__internal__:0)
[info]         at datafusion_physical_plan::sorts::sort::ExternalSorter::in_mem_sort_stream(__internal__:0)
[info]         at <datafusion_physical_plan::stream::RecordBatchStreamAdapter<S> as futures_core::stream::Stream>::poll_next(__internal__:0)
[info]         at <datafusion_physical_plan::joins::sort_merge_join::SortMergeJoinStream as futures_core::stream::Stream>::poll_next(__internal__:0)

andygrove · 2025-01-23T21:51:48Z

Another failure:

org.apache.comet.CometNativeException: Cast error: Failed to convert 1140852704 to temporal for Date32

codecov-commenter · 2025-01-24T21:17:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 39.09%. Comparing base (f09f8af) to head (6f8d2fa).
Report is 10 commits behind head on main.

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1332       +/-   ##
=============================================
- Coverage     56.12%   39.09%   -17.04%     
- Complexity      976     2065     +1089     
=============================================
  Files           119      260      +141     
  Lines         11743    60237    +48494     
  Branches       2251    12817    +10566     
=============================================
+ Hits           6591    23548    +16957     
- Misses         4012    32205    +28193     
- Partials       1140     4484     +3344

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2025-01-28T17:25:31Z

native/spark-expr/src/conversion_funcs/cast.rs

+        DataType::Null => {
+            matches!(to_type, DataType::List(_))
+        }


This was needed to fix test failures in CometArrayExpressionSuite.

Will be that rolled back later?

No, we will likely need to add support for more casts from Null to other types in the future, especially once we add support for more complex types.

comphead · 2025-01-28T17:34:28Z

native/Cargo.toml

@@ -33,21 +33,21 @@ edition = "2021"
 rust-version = "1.79"


this probably can be updated to 1.80 like in datafusion or 1.81, the PR is already created

comphead · 2025-01-28T17:37:54Z

native/core/src/execution/operators/filter.rs

@@ -62,6 +65,8 @@ pub struct FilterExec {
    default_selectivity: u8,
    /// Properties equivalence properties, partitioning, etc.
    cache: PlanProperties,
+    /// The projection indices of the columns in the output schema of join
+    projection: Option<Vec<usize>>,


is this projection a part of migration? if so the migration is quite complicated....

We currently maintain a copy of DataFusion's FilterExec with one small change, so I copied over that latest to keep in sync and then re-applied the change that we need (for memory safety because of the way we re-use buffers).

comphead · 2025-01-28T17:42:21Z

I think the PR is good in general but what concerns me is really lots of code added just to do the migration. I'm wondering was there breaking changes in DF or Arrow, as looks like we agreed to avoid breaking public API changes in DF

andygrove · 2025-01-28T18:57:18Z

I think the PR is good in general but what concerns me is really lots of code added just to do the migration. I'm wondering was there breaking changes in DF or Arrow, as looks like we agreed to avoid breaking public API changes in DF

The biggest issue was apache/datafusion#14277

comphead

lgtm thanks @andygrove there is still some download issue for one of the test suites

viirya · 2025-01-28T20:17:20Z

native/core/src/execution/operators/filter.rs

+                    let projected_columns = projection
+                        .iter()
+                        .map(|i| Arc::clone(batch.column(*i)))
+                        .collect();
+                    let projected_batch =
+                        RecordBatch::try_new(Arc::clone(output_schema), projected_columns)?;


Normally projection should come after predicate, no?

Oh, you already got predicate filter result filter_array. Then it doesn't matter.

viirya

Looks good to me. But I have a question for reusing dictionary id.

andygrove · 2025-01-28T20:26:14Z

Looks good to me. But I have a question for reusing dictionary id.

For more context on this, see the discussion in apache/arrow-rs#5981

bump DataFusion to rev 5592834

7d08f7f

andygrove commented Jan 23, 2025

View reviewed changes

native/core/src/execution/operators/filter.rs Outdated Show resolved Hide resolved

andygrove marked this pull request as draft January 23, 2025 17:17

andygrove commented Jan 23, 2025

View reviewed changes

native/core/src/execution/shuffle/row.rs Show resolved Hide resolved

andygrove added 3 commits January 23, 2025 11:35

update FilterExec

735ff00

fix regression

dee1b65

fmt

30270e7

revert change

b11d2e7

andygrove added 2 commits January 23, 2025 15:07

fix regression

e081fed

fix

f2c1409

andygrove mentioned this pull request Jan 24, 2025

Test DataFusion 45.0.0 with Comet apache/datafusion#14274

Open

2 tasks

use temp datafusion branch

6400197

andygrove changed the title ~~chore: Bump DataFusion to rev 5592834~~ chore: Prepare for DataFusion 45 Jan 24, 2025

andygrove added 2 commits January 24, 2025 14:25

try removing Field::new_with_dict

0ee0b4c

clippy

eb7ddef

andygrove mentioned this pull request Jan 26, 2025

This Week in Comet (Jan 26) #1342

Open

andygrove added 9 commits January 27, 2025 09:16

coerce types for CASE expressions

2f65a31

upmerge

cf6b3c7

save experiments

48c6419

test passes

3de7be1

remove debug logging

e8f222d

remove debug logging

6fff6af

remove debug logging

18a2c59

revert test change

c86dd39

clippy

453bf5b

andygrove commented Jan 28, 2025

View reviewed changes

revert whitespace change

65acbb5

andygrove changed the title ~~chore: Prepare for DataFusion 45~~ chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) Jan 28, 2025

andygrove marked this pull request as ready for review January 28, 2025 17:26

andygrove requested review from viirya, comphead and kazuyukitanimura January 28, 2025 17:27

comphead reviewed Jan 28, 2025

View reviewed changes

Set rust-version to 1.81

6f8d2fa

comphead reviewed Jan 28, 2025

View reviewed changes

comphead approved these changes Jan 28, 2025

View reviewed changes

viirya reviewed Jan 28, 2025

View reviewed changes

viirya approved these changes Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) #1332

chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) #1332

andygrove commented Jan 23, 2025 •

edited

Loading

andygrove Jan 23, 2025

viirya Jan 28, 2025

viirya Jan 28, 2025

viirya Jan 28, 2025

kazuyukitanimura Jan 29, 2025

andygrove Jan 29, 2025

kazuyukitanimura Jan 29, 2025

andygrove Jan 29, 2025

andygrove commented Jan 23, 2025

andygrove commented Jan 23, 2025

codecov-commenter commented Jan 24, 2025 •

edited

Loading

andygrove Jan 28, 2025

comphead Jan 28, 2025

andygrove Jan 29, 2025

comphead Jan 28, 2025 •

edited

Loading

comphead Jan 28, 2025

andygrove Jan 29, 2025

comphead commented Jan 28, 2025

andygrove commented Jan 28, 2025

comphead left a comment

viirya Jan 28, 2025

viirya Jan 28, 2025

viirya left a comment

andygrove commented Jan 28, 2025

chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) #1332

Are you sure you want to change the base?

chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) #1332

Conversation

andygrove commented Jan 23, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jan 23, 2025

andygrove commented Jan 23, 2025

codecov-commenter commented Jan 24, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Jan 28, 2025

andygrove commented Jan 28, 2025

comphead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

andygrove commented Jan 28, 2025

andygrove commented Jan 23, 2025 •

edited

Loading

codecov-commenter commented Jan 24, 2025 •

edited

Loading

comphead Jan 28, 2025 •

edited

Loading