Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All except simplification: Option 1 #1036

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Jan 28, 2025

Fixes #761

This PR rewrites except to disallow pointing to nested columns. The columns selection DSL should only be allowed to "select" columns, not restructure it; we have move, remove, etc. for that.

When a user now calls, say

df.select { all() except userData.age }

they will receive an IllegalArgumentException:
Cannot exclude the nested columns '[userData/age]' from the column set '[userData/{age, name}, otherCol1, otherCol2]' with the except-functions. Use the `DataFrame<*>.remove { }` operation instead.
because userData.age is not directly selected by all(), only its parent, userdata, is.

TODO:

  • Documentation website

Some internal struggle: It seems like except was used internally in a couple places and it relied on the old behavior.

Rewriting this now pushes me to this horrible notation:

Before:

df.groupBy { allExcept(columnSetWithNestedCols) }...

After:

df.groupBy {
    (this as DataFrame<*>)
        .remove(columnSetWithNestedCols)
        .columns()
        .toColumnSet()
}...

(something like df.remove {}.groupBy {} wouldn't not work, because you still want the columns inside the group)

Having to write something like this makes me rethink what we should do about this. Clearly there's a need for selecting the top-level columns of a version of the current dataframe where a (set of) column(s) and their parents are removed. Though, it seems out of the scope for except (and the 0.15 version of except even breaks in some cases). We could:

  • Fix the bugs in except and keep it as is
  • introduce a new function that has the old behavior with a good name
  • Create a way to access the original DF in the selection dsl and promote the idea that you can use functions like remove on it. Probably also needs a df.toColumnSet() function.
  • Merge this as is and just block except with nested columns
    • (We only use the concept 3 times after all)
  • Something else

@zaleslaw @koperagen @nikitinas @belovrv WDYT?

…only exclude columns present in the column set (or the column set created by calling all(Cols)() in the all(Cols)Except() cases). If we detect a user is trying to except a nested column, we throw a helpful exception.
…age pointing users to use String or remove {}
@Jolanrensen Jolanrensen added bug Something isn't working enhancement New feature or request labels Jan 28, 2025
@Jolanrensen Jolanrensen added this to the 0.16.0 milestone Jan 28, 2025
@zaleslaw
Copy link
Collaborator

I believe that we should split our internallyUsedExcept and newOneExcept in our publicAPI

@Jolanrensen
Copy link
Collaborator Author

As discussed on Slack, we might actually need the ability to select column groups with some column removed, especially in groupBy, so it may be best to revert this, fix the NPE, and revert/stabilize #1030. Let's revisit it later

@Jolanrensen Jolanrensen changed the title All except simplification All except simplification: : Option 1 Jan 29, 2025
@Jolanrensen Jolanrensen changed the title All except simplification: : Option 1 All except simplification: Option 1 Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NullPointerException using except to exclude double nested column
2 participants