-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Implement sorting #13
Conversation
…acked columns This is so that we can trivially re-use it e.g. for sorts (I think it is also just a slightly more obvious organization of the code, even if the early cleanup is a bit awkward). Signed-off-by: Sebastian Berg <[email protected]>
This adds basic sorting. The only missing piece right now is supporting column and null order arguments. Balancing was manually checked and seems fine, the tests seem pretty good about checking the splitting but some random datasets should be thrown into the mix for sure. Signed-off-by: Sebastian Berg <[email protected]>
Signed-off-by: Sebastian Berg <[email protected]>
Signed-off-by: Sebastian Berg <[email protected]>
Signed-off-by: Sebastian Berg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the split-selection part. I don't quite understand why we select one additional split point (the element at pos 0). This does not break the algo but seems non intuitive when we only select one less value after merging. Could you explain the reasoning behing this?
std::vector<cudf::size_type> split_values; | ||
cudf::size_type split_offset = 0; | ||
|
||
if (include_start) { split_values.push_back(0); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this used for? This essentially adds another split point leading to nsplis+1
segments?
Signed-off-by: Sebastian Berg <[email protected]>
Signed-off-by: Sebastian Berg <[email protected]>
@madsbk if you have time at some point, would be nice if you can have a look. I should duplicate the shuffling to have a proper broadcast-shuffle, but I think that is the only issue. I discussed correctness with @mfoerste4, so I think review is just about code and not logic (like how we split exactly). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I only have some minor suggestions
cpp/CMakeLists.txt
Outdated
@@ -71,7 +71,7 @@ include(cmake/Modules/ConfigureCUDA.cmake) # set other CUDA compilation flags | |||
# * dependencies ---------------------------------------------------------------------------------- | |||
|
|||
rapids_find_package( | |||
legate REQUIRED Legion LegionRuntime | |||
legate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legate | |
legate REQUIRED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't Legate still required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert this. I am honestly not sure whether it is, but this was an (unsuccessful) try to fix the linking issue with current legate.
LogicalTable sort(const LogicalTable& tbl, | ||
const std::vector<std::string>& keys, | ||
const std::vector<cudf::order>& column_order, | ||
const std::vector<cudf::null_order>& null_precedence, | ||
bool stable = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good with a docstring of sort
Signed-off-by: Sebastian Berg <[email protected]>
I opened gh-19 for the task of cleaning up the shuffling a bit. Maybe easier as a follow-up (but don't hesitate to kick me to do it or any other follow-ups here!). |
This implements sorting based on the same approach as cupynumeric. There are a few smaller rough edges, but it is getting there.
Things that are still needed:
shuffle
is used fromrepartition_by_hash.cu
).shuffle
(not sure this is a big deal).Ready enough for review I think. The possible bigger follow-up now is to maybe move
shuffle
and use a properbroadcast
for the split candidate sharing.