Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Implement sorting #13

Merged
merged 9 commits into from
Jan 30, 2025
Merged

API: Implement sorting #13

merged 9 commits into from
Jan 30, 2025

Conversation

seberg
Copy link
Contributor

@seberg seberg commented Jan 16, 2025

This implements sorting based on the same approach as cupynumeric. There are a few smaller rough edges, but it is getting there.

Things that are still needed:

  • Support of column and sort order (internals understand it).
  • Some smaller cleanups maybe (e.g. shuffle is used from repartition_by_hash.cu).
  • We need to broadcast splits to all, this abuses shuffle (not sure this is a big deal).
  • Some simple C-tests are needed
  • Python tests are good about checking the splits (i.e. the window). But need to add randomized data and tests for the missing API (column/null order).

Ready enough for review I think. The possible bigger follow-up now is to maybe move shuffle and use a proper broadcast for the split candidate sharing.

…acked columns

This is so that we can trivially re-use it e.g. for sorts (I think it
is also just a slightly more obvious organization of the code, even if
the early cleanup is a bit awkward).

Signed-off-by: Sebastian Berg <[email protected]>
This adds basic sorting.  The only missing piece right now is supporting
column and null order arguments.

Balancing was manually checked and seems fine, the tests seem pretty good
about checking the splitting but some random datasets should be thrown
into the mix for sure.

Signed-off-by: Sebastian Berg <[email protected]>
@seberg seberg requested a review from mfoerste4 January 17, 2025 10:29
@seberg seberg marked this pull request as ready for review January 17, 2025 16:47
Copy link

@mfoerste4 mfoerste4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the split-selection part. I don't quite understand why we select one additional split point (the element at pos 0). This does not break the algo but seems non intuitive when we only select one less value after merging. Could you explain the reasoning behing this?

cpp/src/sort.cpp Outdated Show resolved Hide resolved
std::vector<cudf::size_type> split_values;
cudf::size_type split_offset = 0;

if (include_start) { split_values.push_back(0); }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for? This essentially adds another split point leading to nsplis+1 segments?

cpp/src/sort.cpp Show resolved Hide resolved
cpp/src/sort.cpp Show resolved Hide resolved
@seberg
Copy link
Contributor Author

seberg commented Jan 29, 2025

@madsbk if you have time at some point, would be nice if you can have a look. I should duplicate the shuffling to have a proper broadcast-shuffle, but I think that is the only issue.

I discussed correctness with @mfoerste4, so I think review is just about code and not logic (like how we split exactly).

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I only have some minor suggestions

@@ -71,7 +71,7 @@ include(cmake/Modules/ConfigureCUDA.cmake) # set other CUDA compilation flags
# * dependencies ----------------------------------------------------------------------------------

rapids_find_package(
legate REQUIRED Legion LegionRuntime
legate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
legate
legate REQUIRED

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Legate still required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this. I am honestly not sure whether it is, but this was an (unsuccessful) try to fix the linking issue with current legate.

Comment on lines +26 to +30
LogicalTable sort(const LogicalTable& tbl,
const std::vector<std::string>& keys,
const std::vector<cudf::order>& column_order,
const std::vector<cudf::null_order>& null_precedence,
bool stable = false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good with a docstring of sort

@seberg
Copy link
Contributor Author

seberg commented Jan 30, 2025

I opened gh-19 for the task of cleaning up the shuffling a bit. Maybe easier as a follow-up (but don't hesitate to kick me to do it or any other follow-ups here!).
I think we might re-organize the whole shuffle code soon anyway, though.

@seberg seberg merged commit 277fc58 into rapidsai:main Jan 30, 2025
10 checks passed
@seberg seberg deleted the impl-sort branch January 30, 2025 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants