Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar for plan constraints #20
base: main
Are you sure you want to change the base?
Grammar for plan constraints #20
Changes from 3 commits
cac050a
5ea27d8
3fcc611
ff71b40
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this feature Plan Hints rather than Plan Constraints. The optimizer may not follow the hints (though we should issue a warning if we don't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will rename
Agreed on providing a warning if some/all of the plan hint is not applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had mentioned this in an earlier comment, we plan to support join hints for outer joins as well. This provides support for the same. I am OK removing it for now, since v1 will not have support for using this hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the type of join is determined by the query, so even if we support join hints, we shouldn't need to specify if it's an inner or outer join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/when we support outer join reordering, we would need these to specify hints for outer-join reordering. Of course, if the planner never evaluates these hint choices, these hints are ignored and there is zero chance of a correctness issue w.r.t what type of join to choose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this PARTITIONED. JOIN_DIST_PARTITIONED is clunky to write and to remember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some confusion, the user will use
[P]
and[R]
to specify the constraint specification (see L53-54 for lexer rules), notJOIN_DIST_PARTITIONED
orJOIN_DIST_REPLICATED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thanks. missed this. Still would recommend using broadcast/ B vs. R because that's more consistent with the user facing language we use elsewhere (e.g. session property join_distribution_type=BROADCAST).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just use replicated or paritioned syntax? Imo the brackets and
join ((a c [R]) b [P])
might get complicated quicklyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what this would look like for the cited example ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like dremio or databricks.
Current way also is fine but suggestion is to use explicit names for broadcast/partition etc since the goal of this is to allow users to explictly set the types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to this; my only gripe with it is that it that the plan constraint string can get quite verbose, e.g for a 4-table join order -
join (d ((a c [REPLICATED]) b [PARTITIONED]) [PARTITIONED] )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this fine grained kind of control. I like the simplicity of just specifying the partitioning of the table, and every other example I found seems to take that approach.
BROADCAST(T1)
PQ_DISTRIBUTE(s BROADCAST, NONE)
(which I also find very clunky).BROADCAST
inlineYou can specify to use syntactic join order if you want to control the join ordering in a more complex way.
I find the existing syntax complex and hard to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB2 supports 'join requests' similar to these join hints. It has variants that specify the join method too
Using a join hint allow us to :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I see the appeal of specifying a partial join order for auto generated queries, but also would like it to be easy to specify join hints for the common case where people just want to mark some table for broadcast or similar. I wonder if there's an alternative approach we can use to achieve both of these goals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add Spark style independent hints -
BROADCAST(T1)
- Always broadcast T1 in chosen as the build side of a joinBROADCAST(a b)
- If the optimizer chooses a logical join ofa
and b`, and this is a sub-tree of another join graph, use BROADCAST for the join distributionThese will be complementary to the join-order syntax. So the below join hints are equivalent but provide flexibility to the user -
Will think of more examples and incorporate them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not expose this kind of control. Better to specify what you want to have happen with this table, and otherwise leave it be. Seems pretty risky to encode cardinality estimates in the query text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the risk ? I see this
card
estimate as a simple way to fixup estimation errors for when -WITH CTE
example at L144 below)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The risk I see is that if you are specifying a cardinality of x, you are probably doing it because you want the optimizer to do some particular thing about it. But you don't really know what the optimizer will do with that information, and it could do one thing for a while, and then in a new release there's an optimizer change and it does something else. Because you aren't directly controlling what happens when you specify the cardinality, it's hard to say how it might affect the query, and could be hard to debug if the performance degrades (vs. if you e.g. specify broadcast join and your data gets bigger, it's very clear what happens)
It can also get out of sync with the data (there is always a risk with hand tuning that the optimization will no longer be relevan or will perform worse as the data changes, but specifying a specific cardinality estimate can have more varied and unknown effects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a powerful knob to give to the users, one that we can document as such (with caveats)
To aid debugging, we can add
CardEstimateBasedSourceInfo
similar to how we have aHistoryBasedSourceInfo
that will make it clear to the users how the cardinality was arrived at in EXPLAIN/EXPLAIN ANALYZE (and event listener etc.)Additional safeguards like warnings & metrics can be incorporated too if the stats estimate differ widely from actual runtime observed cardinality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have an option for constraint within cte also I presume. I won't push on it, but it would be good to add that example as well.