Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44629: [C++][Acero] Use implicit_ordering for asof_join rather than require_sequenced_output #44616

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Nov 1, 2024

Rationale for this change

Changes in #44083 (GH-41706) unnecessarily sequences batches retrieved from scanner where it only requires the batches to provide index according to implicit input order.

What changes are included in this PR?

Setting implicit_ordering causes existing code to set batch index, which is then available to the asof_join node to sequence the batches int input order. This replaces some of #44083 changes.

Some code introduced by #44083 turns out to not be required and has therefore been reverted.

Are these changes tested?

Existing unit tests still pass.

Are there any user-facing changes?

Reverts some breaking user-facing changes done by #44083.

Copy link

@gitmodimo gitmodimo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good. I think all source nodes need an option to assert Implicit ordering but I am not sure if it is in the scope of this PR.
Is there currently any use for require_sequenced_output?

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Nov 1, 2024

@westonpace are you happy with this?

@kou
Copy link
Member

kou commented Nov 2, 2024

Could you open a new issue instead of reusing closed issue?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 2, 2024
@EnricoMi EnricoMi changed the title GH-41706: [C++][Acero][Followup] Use implicit_ordering for asof_join rather than require_sequenced_output GH-44629: [C++][Acero] Use implicit_ordering for asof_join rather than require_sequenced_output Nov 3, 2024
Copy link

github-actions bot commented Nov 3, 2024

⚠️ GitHub issue #44629 has been automatically assigned in GitHub to PR creator.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Nov 3, 2024

@kou done

@rok rok requested a review from felipecrv November 7, 2024 12:15
@EnricoMi
Copy link
Contributor Author

@lidavidm what do you think?

@rok rok requested a review from gitmodimo November 15, 2024 11:34
Copy link

@gitmodimo gitmodimo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it in the scope of this PR, but Implicit Ordering in other source nodes also need some consideration. My thoughts:
SourceNode - ordering depends on generator and can be asserted by option - seems reasonable
TableSourceNode - currently has implicit ordering which seems reasonable since tables do have an order. Probably little to none benefit in dropping the it.
SchemaSourceNode,RecordBatchSourceNode - currently has implicit ordering, but ordering depends on generator/iterator and should not default to either. I think it should be on option like in SourceNode
RecordBatchReaderSourceNode - currently has no implicit ordering - Same case as SchemaSourceNode

I think either all source nodes should have ordering option or the ordering should originate from arrow::AsyncGenerator ?

Other than that it looks good. Thanks for fixing my mistakes.

@lidavidm
Copy link
Member

Sorry @EnricoMi I wasn't involved in most of the Acero development so I can't really speak to the workings here. @bkietz would probably know better

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Dec 4, 2024

@westonpace since you approved #44083, can you take a look at this fix?

@EnricoMi
Copy link
Contributor Author

@zanmato1984 can you have a look at this cleanup of an earlier PR, please?

@zanmato1984
Copy link
Contributor

OK, I'll take a look.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Just several questions. Thank you.

Comment on lines 575 to +576
bool require_sequenced_output;
bool implicit_ordering;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there are documents of these two fields in the python counterpart. Could you add them in C++ too so this can be self-explaining?


std::shared_ptr<Dataset> dataset;
std::shared_ptr<ScanOptions> scan_options;
bool require_sequenced_output;
bool implicit_ordering;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, require_sequenced_output is handled by the scanner by collapsing the underlying generator to single-threaded, whereas implicit_ordering is delegated to the generated source node?

Comment on lines 4115 to +4116
require_sequenced_output : bool, default False
Assert implicit ordering on data.
Batches are yielded sequentially, like single-threaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants