API: Implement sorting #13

seberg · 2025-01-16T22:37:47Z

This implements sorting based on the same approach as cupynumeric. There are a few smaller rough edges, but it is getting there.

Things that are still needed:

Support of column and sort order (internals understand it).
Some smaller cleanups maybe (e.g. shuffle is used from repartition_by_hash.cu).
We need to broadcast splits to all, this abuses shuffle (not sure this is a big deal).
Some simple C-tests are needed
Python tests are good about checking the splits (i.e. the window). But need to add randomized data and tests for the missing API (column/null order).

Ready enough for review I think. The possible bigger follow-up now is to maybe move shuffle and use a proper broadcast for the split candidate sharing.

…acked columns This is so that we can trivially re-use it e.g. for sorts (I think it is also just a slightly more obvious organization of the code, even if the early cleanup is a bit awkward). Signed-off-by: Sebastian Berg <[email protected]>

This adds basic sorting. The only missing piece right now is supporting column and null order arguments. Balancing was manually checked and seems fine, the tests seem pretty good about checking the splitting but some random datasets should be thrown into the mix for sure. Signed-off-by: Sebastian Berg <[email protected]>

Signed-off-by: Sebastian Berg <[email protected]>

mfoerste4

I only reviewed the split-selection part. I don't quite understand why we select one additional split point (the element at pos 0). This does not break the algo but seems non intuitive when we only select one less value after merging. Could you explain the reasoning behing this?

cpp/src/sort.cpp

mfoerste4 · 2025-01-20T12:05:14Z

cpp/src/sort.cpp

+  std::vector<cudf::size_type> split_values;
+  cudf::size_type split_offset = 0;
+
+  if (include_start) { split_values.push_back(0); }


What is this used for? This essentially adds another split point leading to nsplis+1 segments?

cpp/src/sort.cpp

Signed-off-by: Sebastian Berg <[email protected]>

seberg · 2025-01-29T13:03:59Z

@madsbk if you have time at some point, would be nice if you can have a look. I should duplicate the shuffling to have a proper broadcast-shuffle, but I think that is the only issue.

I discussed correctness with @mfoerste4, so I think review is just about code and not logic (like how we split exactly).

madsbk

Nice work! I only have some minor suggestions

madsbk · 2025-01-29T13:30:03Z

cpp/CMakeLists.txt

@@ -71,7 +71,7 @@ include(cmake/Modules/ConfigureCUDA.cmake) # set other CUDA compilation flags
 # * dependencies ----------------------------------------------------------------------------------

 rapids_find_package(
-  legate REQUIRED Legion LegionRuntime
+  legate


Suggested change

legate

legate REQUIRED

Isn't Legate still required?

I'll revert this. I am honestly not sure whether it is, but this was an (unsuccessful) try to fix the linking issue with current legate.

madsbk · 2025-01-29T13:31:00Z

cpp/include/legate_dataframe/sort.hpp

+LogicalTable sort(const LogicalTable& tbl,
+                  const std::vector<std::string>& keys,
+                  const std::vector<cudf::order>& column_order,
+                  const std::vector<cudf::null_order>& null_precedence,
+                  bool stable = false);


It would be good with a docstring of sort

Signed-off-by: Sebastian Berg <[email protected]>

seberg · 2025-01-30T11:46:00Z

I opened gh-19 for the task of cleaning up the shuffling a bit. Maybe easier as a follow-up (but don't hesitate to kick me to do it or any other follow-ups here!).
I think we might re-organize the whole shuffle code soon anyway, though.

seberg added 2 commits January 16, 2025 23:32

seberg requested a review from mfoerste4 January 17, 2025 10:29

seberg added 2 commits January 17, 2025 16:24

ENH: Forward column order and null precedence and add more python tests

123afb2

Signed-off-by: Sebastian Berg <[email protected]>

TST: Add basic C-side tests for sorting

23145ae

Signed-off-by: Sebastian Berg <[email protected]>

seberg marked this pull request as ready for review January 17, 2025 16:47

DOC: Add sort to the documentation

64ab9c3

Signed-off-by: Sebastian Berg <[email protected]>

seberg force-pushed the impl-sort branch from 3637845 to 64ab9c3 Compare January 17, 2025 16:57

mfoerste4 approved these changes Jan 20, 2025

View reviewed changes

TST: Cython doesn't raise TypeError, so just comment out for now

dd1570a

Signed-off-by: Sebastian Berg <[email protected]>

seberg force-pushed the impl-sort branch from e1a1dd3 to c3b32ed Compare January 20, 2025 17:55

MAINT: Remove explicit legion dependency (and small fix based on review)

4b78fe3

Signed-off-by: Sebastian Berg <[email protected]>

seberg force-pushed the impl-sort branch from c3b32ed to 4b78fe3 Compare January 20, 2025 17:56

Merge branch 'main' into impl-sort

c0d2442

madsbk requested changes Jan 29, 2025

View reviewed changes

Address review comments (and make local pre-commit happy)

00e8e9d

Signed-off-by: Sebastian Berg <[email protected]>

madsbk approved these changes Jan 30, 2025

View reviewed changes

seberg merged commit 277fc58 into rapidsai:main Jan 30, 2025
10 checks passed

seberg deleted the impl-sort branch January 30, 2025 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Implement sorting #13

API: Implement sorting #13

seberg commented Jan 16, 2025 •

edited

Loading

mfoerste4 left a comment

mfoerste4 Jan 20, 2025

seberg commented Jan 29, 2025

madsbk left a comment

madsbk Jan 29, 2025

madsbk Jan 29, 2025

seberg Jan 29, 2025

madsbk Jan 29, 2025

seberg commented Jan 30, 2025

API: Implement sorting #13

API: Implement sorting #13

Conversation

seberg commented Jan 16, 2025 • edited Loading

mfoerste4 left a comment

Choose a reason for hiding this comment

mfoerste4 Jan 20, 2025

Choose a reason for hiding this comment

seberg commented Jan 29, 2025

madsbk left a comment

Choose a reason for hiding this comment

madsbk Jan 29, 2025

Choose a reason for hiding this comment

madsbk Jan 29, 2025

Choose a reason for hiding this comment

seberg Jan 29, 2025

Choose a reason for hiding this comment

madsbk Jan 29, 2025

Choose a reason for hiding this comment

seberg commented Jan 30, 2025

seberg commented Jan 16, 2025 •

edited

Loading