-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) #1332
base: main
Are you sure you want to change the base?
Conversation
@@ -304,11 +304,7 @@ fn scan_schema(input_batch: &InputBatch, data_types: &[DataType]) -> SchemaRef { | |||
.map(|(idx, c)| { | |||
let datatype = ScanExec::unpack_dictionary_type(c.data_type()); | |||
// We don't use the field name. Put a placeholder. | |||
if matches!(datatype, DataType::Dictionary(_, _)) { | |||
Field::new_dict(format!("col_{}", idx), datatype, true, idx as i64, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is no longer possible to re-use dictionary id across fields. I am unsure of the impact here. Perhaps @viirya will know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you mean to re-use dictionary id across fields? Dictionary id should be unique per field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, corrected it. If two fields have same dictionary, they may use same dictionary id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you resolved this. Is it not an issue now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to do some micro benchmarks on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no performance impact from this change. The original code stored a dictionary id in the metadata and the new code does not. This dictionary id is actually not used at all in FFI. It was used in Arrow IPC but is no longer used as of Arrow 54.0.0 because that feature is now removed and Arrow IPC manages its own dictionary ids. We do not use Arrow IPC now because we are using our own proprietary encoding.
I will go ahead and run another TPC-H benchmark and post results here today, just to confirm there are no regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure, will this work even the enableFastEncoding
option is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fresh benchmark results:
Using fast encoding = 332 seconds (our published time is 331, and I do see small variance on each run)
Using Arrow IPC = 334 seconds
Tests are failing:
|
Another failure:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1332 +/- ##
=============================================
- Coverage 56.12% 39.09% -17.04%
- Complexity 976 2065 +1089
=============================================
Files 119 260 +141
Lines 11743 60237 +48494
Branches 2251 12817 +10566
=============================================
+ Hits 6591 23548 +16957
- Misses 4012 32205 +28193
- Partials 1140 4484 +3344 ☔ View full report in Codecov by Sentry. |
DataType::Null => { | ||
matches!(to_type, DataType::List(_)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed to fix test failures in CometArrayExpressionSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be that rolled back later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we will likely need to add support for more casts from Null
to other types in the future, especially once we add support for more complex types.
native/Cargo.toml
Outdated
@@ -33,21 +33,21 @@ edition = "2021" | |||
rust-version = "1.79" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably can be updated to 1.80 like in datafusion or 1.81, the PR is already created
@@ -62,6 +65,8 @@ pub struct FilterExec { | |||
default_selectivity: u8, | |||
/// Properties equivalence properties, partitioning, etc. | |||
cache: PlanProperties, | |||
/// The projection indices of the columns in the output schema of join | |||
projection: Option<Vec<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this projection a part of migration? if so the migration is quite complicated....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently maintain a copy of DataFusion's FilterExec
with one small change, so I copied over that latest to keep in sync and then re-applied the change that we need (for memory safety because of the way we re-use buffers).
I think the PR is good in general but what concerns me is really lots of code added just to do the migration. I'm wondering was there breaking changes in DF or Arrow, as looks like we agreed to avoid breaking public API changes in DF |
The biggest issue was apache/datafusion#14277 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @andygrove there is still some download issue for one of the test suites
let projected_columns = projection | ||
.iter() | ||
.map(|i| Arc::clone(batch.column(*i))) | ||
.collect(); | ||
let projected_batch = | ||
RecordBatch::try_new(Arc::clone(output_schema), projected_columns)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally projection should come after predicate, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you already got predicate filter result filter_array
. Then it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. But I have a question for reusing dictionary id.
For more context on this, see the discussion in apache/arrow-rs#5981 |
Which issue does this PR close?
Part of #1304
Rationale for this change
Use latest DF in preparation for upgrading to DF 45.
What changes are included in this PR?
Field::new_dict
How are these changes tested?