-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44629: [C++][Acero] Use implicit_ordering
for asof_join
rather than require_sequenced_output
#44616
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good. I think all source nodes need an option to assert Implicit ordering but I am not sure if it is in the scope of this PR.
Is there currently any use for require_sequenced_output
?
@westonpace are you happy with this? |
Could you open a new issue instead of reusing closed issue? |
implicit_ordering
for asof_join
rather than require_sequenced_output
implicit_ordering
for asof_join
rather than require_sequenced_output
|
@kou done |
@lidavidm what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it in the scope of this PR, but Implicit Ordering in other source nodes also need some consideration. My thoughts:
SourceNode
- ordering depends on generator and can be asserted by option - seems reasonable
TableSourceNode
- currently has implicit ordering which seems reasonable since tables do have an order. Probably little to none benefit in dropping the it.
SchemaSourceNode
,RecordBatchSourceNode
- currently has implicit ordering, but ordering depends on generator/iterator and should not default to either. I think it should be on option like in SourceNode
RecordBatchReaderSourceNode
- currently has no implicit ordering - Same case as SchemaSourceNode
I think either all source nodes should have ordering
option or the ordering should originate from arrow::AsyncGenerator
?
Other than that it looks good. Thanks for fixing my mistakes.
@westonpace since you approved #44083, can you take a look at this fix? |
@zanmato1984 can you have a look at this cleanup of an earlier PR, please? |
b436a8a
to
afb0783
Compare
OK, I'll take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Just several questions. Thank you.
bool require_sequenced_output; | ||
bool implicit_ordering; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are documents of these two fields in the python counterpart. Could you add them in C++ too so this can be self-explaining?
|
||
std::shared_ptr<Dataset> dataset; | ||
std::shared_ptr<ScanOptions> scan_options; | ||
bool require_sequenced_output; | ||
bool implicit_ordering; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, require_sequenced_output
is handled by the scanner by collapsing the underlying generator to single-threaded, whereas implicit_ordering
is delegated to the generated source node?
require_sequenced_output : bool, default False | ||
Assert implicit ordering on data. | ||
Batches are yielded sequentially, like single-threaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
Rationale for this change
Changes in #44083 (GH-41706) unnecessarily sequences batches retrieved from scanner where it only requires the batches to provide index according to implicit input order.
What changes are included in this PR?
Setting
implicit_ordering
causes existing code to set batch index, which is then available to theasof_join
node to sequence the batches int input order. This replaces some of #44083 changes.Some code introduced by #44083 turns out to not be required and has therefore been reverted.
Are these changes tested?
Existing unit tests still pass.
Are there any user-facing changes?
Reverts some breaking user-facing changes done by #44083.
implicit_ordering
forasof_join
rather thanrequire_sequenced_output
#44629