-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GiST support #42
base: main
Are you sure you want to change the base?
GiST support #42
Conversation
e2c18e1
to
83a6571
Compare
There is still some logical error in the current implementation, as seen from this simple test: \set hexagon '\'831c02fffffffff\'::h3index'
CREATE TABLE h3_test_gist (hex h3index);
CREATE INDEX GIST_IDX ON h3_test_gist USING gist(hex);
-- insert immediate children
INSERT INTO h3_test_gist (hex) SELECT h3_to_children(:hexagon);
-- num children is 7
SELECT COUNT(*) FROM h3_test_gist WHERE hex <@ :hexagon;
7
-- insert deeper children
INSERT INTO h3_test_gist (hex) SELECT h3_to_children(:hexagon, 8);
-- num children is ... ZERO?
SELECT COUNT(*) FROM h3_test_gist WHERE hex <@ :hexagon;
0 Any help figuring it out is much appreciated. ❤️ |
Using my sql function, with the very same logic I ported to c: WITH a AS(SELECT h3_to_children('831c02fffffffff'::h3index, 8) as h)
SELECT
count(*)
FROM a
WHERE h3_contains('831c02fffffffff'::h3index, a.h); Returns Weird ¯\(ツ)/¯ |
That's the problem. The regular operations work fine, but breaks when using the GiST index as it is currently implemented in this branch: CREATE TABLE t (hex h3index);
INSERT INTO t (hex) SELECT h3_to_children('831c02fffffffff', 8);
SELECT COUNT(*) FROM t WHERE hex <@ '831c02fffffffff';
16807
-- after adding gist index
CREATE INDEX idx ON t USING gist(hex);
SELECT COUNT(*) FROM t WHERE hex <@ '831c02fffffffff';
0
It is the algorithm implemented in |
The logic used in Maybe something is lost in translation 😞 . Does the c function work as expected when run standalone? I will give it some more love this afternoon |
Ah sorry for the confusion -- as far as I can tell, your I think the problem lies in the functions // opclass_gist.c
h3index_gist_consistent
h3index_gist_union
h3index_gist_penalty
h3index_gist_picksplit
h3index_gist_same which together defines the behavior of our GiST implementation. |
Is work still being done on this pull request or is it abandoned? Having GIST indices would be great. |
I haven't touched the branch in a year, but I don't wish to abandon it. We need to correctly define the GIST operator class methods for the H3 data type. Our efforts so far (opcalss_gist.c in this branch) results in a fail of our basic test suite. Any help would be appreciated (see #42 (comment))! |
Updated reference: https://www.postgresql.org/docs/13/gist-extensibility.html |
6c410a6
to
4a532cc
Compare
Hey guys, software engineer here from a company that competes in the broad realm of "location-based services". First off, much appreciation for all the work on this project! We are using H3-indexed data in various applications, some of which might greatly benefit from index-accelerated queries. Especially for checking containment of a cell within a set of compacted cells (so I had a quick GPT session to understand how GiST-Indices are designed and I think I'm pretty clear on the topic now. I'll leave the chat here for others to check out. The example I chose in that chat features concepts that seem very close to how a GiST implementation for H3 could look and from what I saw in opclass_gist.cc, this seems to be exactly the path that you are currently on. Have there been any new insights meanwhile that are not reflected in the branch or this discussion? I'm planning to set up an environment via Docker to get going. Is it possible to debug the extension live, e.g. using GDB? Any advice on setting up a dev environment is appreciated. I'm pretty seasoned in C(++) but have never touched a native PSQL extension. |
I'm interested in GiST and GIN support as well. Also looking at porting over |
Hi @mattiZed! Any help would be appreciated, no progress has been made outside this discussion. :-) I haven't managed to get GDB to work. At one point I had a I basically do this:
|
Thanks @zachasme Edit: Alright, managed to compile it just fine and got to the point where the tests are failing. I will try to dedicate some time to this next week. |
Hm, I couldn't help it and fumbled a bit already. I changed my test to perform just this:
Output:
I quickly confirmed via an SQL instance running h3-pg 4.1.2 that this seems indeed correct:
Am I missing something? Edit: Also some timings:
So, while the index seems to perform correctly, at least in this little test (which is not really a meaningful query you would use in an application, after all), it's not of much use. I guess this is not using any binary optimizations to leverage the h3index bit layout? However, if I do something more meaningful:
|
Hm, you are right, I've rebased the branch and it does seem to return the correct results. I actually don't recall what the issue was. I would like to come up with a good test with a large amount of inserts and deletions over a range of resolutions, in order to verify my picksplit/consistent algorithm. Thanks for picking this up @mattiZed. If we feel somewhat comfortable about the correctness, I could release a new version and you/others could start testing on real data. |
Hey @zachasme thanks for the reply. I'll pull your changes and do some more digging. I'd also like to spark some discussion here. While it seems to me that the current implementation yields correct results (bear in mind that I all I did so far, was doing some quick checks in a psql-terminal), I don't understand yet if it's particularly efficient or "exploitative" of the H3Index properties. Primer: No offense. This is not meant to come off condescending. I understand that you might have invested a lot more thought on this topic than I currently have. I just want to share my thoughts :) I'm still understanding how GiST platform really is designed and I would like to take a deeper look at what Debugging
so VSC Editor operations inside the container are not done by the root user (avoids having to chmod/chown all the time)
ThoughtsPenaltyFor some reason, I was initially under the assumption that GiST is a binary tree behind the scenes, so each internal node can exactly have two children. By now I think this is not correct and an internal node can probably have an arbitrary number of children - ultimately this behaviour seems to be governed by the I tried to think of what the Index tree really should look like, if we forget about GiST for a moment: Considering an example where we have a "highly populated" Index, the index tree could look like this: Up to 122 nodes in the first generation (the base cells) and then each node in any following generation could have up to 7 child nodes. We could also think of the index as up to 122 "hepternary" trees since each tree has a base cell as a root node and then each child node could have up to 7 child nodes. I hope I made sense so far. I plan to look closely under the debugger what happens, if we do e.g.
I would expect the index to contain PicksplitI initially thought that this somehow splits what is connected to an internal node. However, the documentation clearly states "index page". I'm not clear if the terms "internal node" and "index page" can be used interchangeably here, or if they are totally different concepts. For now, I decided to treat this function exactly as the documentation states: It should split a vector of elements. I was creating some console output with the DEBUG macro and I was confused that the split seemed very unbalanced (the left side and the right side contained e.g. 4 vs 400 elements) I think all the time. What makes the H3 Picksplit different from other approaches I have seen (e.g. when building a GiST for bounding box datatypes): we cannot just easily create an "enclosing" feature for representing either of the splits that contains exactly all the members of one side of the split as we could with the bounding box example. We are bound to the H3 grid, and many times the representants of either side of the split are exactly the same index. I think splitting could be improved if we
So that, for example, we could end up with a split that contains 2 children in left and 4 children in right. But the 2 children in left might cover 10 entries each, where the 4 children in right cover 6 entries each. I hope this made sense as well. Would love to hear your thoughts. |
Hi @mattiZed, sorry for the late reply. And thank you for looking into this! I got stuck on picking a
I have no idea if the right path forward is to optimize on that initial algorithm, or go with something completely different. |
Replaces #7. Part of #5.
References:
I have closed the old PR which contained code for both GiST and SP-GiST, splitting it into two separate PRs. This PR also includes the work by @AbelVM in #40.
The PR builds and runs, but basic tests fail (see
test/sql/opclass_gist.sql
), so there is some logical error. Any help figuring it out is appreciated. ❤️We should also consider either 1) remove the unused parts of h3Index.h, 2) simply write out the few macros we need, or 3) get the macros pulled into the public API upstream.