Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to create indices on expressions involving some h3 functions in PosrgreSQL 17 #165

Open
spaghettiguru opened this issue Jan 1, 2025 · 9 comments
Assignees
Labels
help wanted ⛏️ Extra attention is needed needs more info 🖐️ More info is required

Comments

@spaghettiguru
Copy link

spaghettiguru commented Jan 1, 2025

First of all, thanks for maintaining this, to everyone involved!

The issue

Given the table defined like this:

CREATE TABLE my_table (
    h3cell h3index
)

An attempt to create the following index on it CREATE INDEX ON my_table USING GIST (h3_cell_to_geometry(h3cell)) fails in Postgres >= 17 .

This seems to happen because of the changes in Postgres 17 around index creation which make accessing non-standard schema-unqualified object references during index creation to fail, since PostgreSQL now temporarily modifies the search_path when running some operations and these objects cannot be found.
Here is the quote from the PostgreSQL 17 release notes:

Change functions to use a safe search_path during maintenance operations

This prevents maintenance operations (ANALYZE, CLUSTER, CREATE INDEX, CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW, REINDEX, or VACUUM) from performing unsafe access. Functions used by expression indexes and materialized views that need to reference non-default schemas must specify a search path during function creation.

They also updated the docs for CREATE INDEX command to contain this:

While CREATE INDEX is running, the search_path is temporarily changed to pg_catalog, pg_temp.

So I guess the solution would be to qualify the references to extension objects inside h3 functions with the right schema - where needed.

@spaghettiguru spaghettiguru changed the title Impossible to create indices on expressions involving h3 functions in PosrgreSQL 17 Impossible to create indices on expressions involving some h3 functions in PosrgreSQL 17 Jan 1, 2025
@jmealo
Copy link
Contributor

jmealo commented Jan 2, 2025

@spaghettiguru: Hey! It's my understanding that GiST support is pending:
#42

I'm having a hard time understanding the use case for converting a cell back into geometry and then doing a geospatial index on that... I appreciate that you've slimmed down to a minimally reproducible example, however, it's a little unclear to me why you'd create an index on that particular expression.

Could you explain your use case? (or was it just to illustrate this issue 🤔 )

@zachasme
Copy link
Owner

zachasme commented Jan 2, 2025

Hi @spaghettiguru

Like @jmealo, this is not something I have any experience with. Can you elaborate on how to "qualify the references to extension objects inside h3 functions with the right schema"?

@zachasme zachasme added help wanted ⛏️ Extra attention is needed needs more info 🖐️ More info is required labels Jan 2, 2025
@zachasme zachasme self-assigned this Jan 2, 2025
@spaghettiguru
Copy link
Author

spaghettiguru commented Jan 5, 2025

Hi @jmealo, the index creation command from my original post is the real thing we are doing and not merely an example.
The reason for creating a geometry index from h3index column is that our queries often do bounding box or polygon intersection with h3index column values to find the rows of interest, e.g. ST_Intersects(h3_cell_to_geometry(h3cell), ST_MakeEnvelope(1,2,3,4,5, 4326)). From what I understand, GiST index will not help much here since it will only accelerate testing inclusion of h3 cells in a larger h3 cell or something like that.
I am new to PostGIS/H3 stuff so I will not be surprised if there is a better way to achieve what we are doing, so your (or anybody else's) advice, in case there is one, is very welcome!
In any case, even if there is a workaround for this specific case that does not involve creating such an index, it still seems that this issue should be fixed since a. somebody else my need to create some kind of index on expression using one of the h3 extension functions and b. the issue is not with index creation only but with other maintenance operations as well, for example PostgreSQL docs say that this breaking change also affects creation of materialized views, among other things.

@zachasme, hi man. I do not have experience with writing or maintaining PostgreSQL extensions, so I am not sure I am able to provide the complete solution. However, I did a little digging and had a look at how PostGIS extension dealt with this breaking PostgreSQL change and what they did is they qualified the references to extension objects in functions using a reserved symbol @extschema@ that PostgreSQL seems to provide for the purpose of referencing the current extension's schema. PostgreSQL docs mention this in this section. This paragraph I found when looking for a link to docs to post here may bring us even closer to the final solution:

If an extension references objects belonging to another extension, it is recommended to schema-qualify those references. To do that, write @extschema:name@ in the extension's script file, where name is the name of the other extension (which must be listed in this extension's requires list). This string will be replaced by the name (double-quoted if necessary) of that extension's target schema.

Hope it will help.

@jmealo
Copy link
Contributor

jmealo commented Jan 5, 2025

This is super helpful.

What are you trying to do at a higher level and how many records are you dealing with?

That would help see if there's a better way.

For context: Using a bounding box to do a coarse filter to reduce the number of distance calculations is something we'd do to speed up queries in PostGIS sans-H3.

To help you find the best way to do something we'll need to start before applying the bounding box and focus on your end goal(s).

There are often h3 native ways to do things as well depending on your use case.

If you're new to geospatial or H3: Understanding the different zoom levels and how h3 works can be very helpful. I'll admit, the examples of how to do some of the more advanced things in h3 (particularly using h3-pg) is lacking. The bright side is that the h3 documentation is pretty excellent and you'll want to skim that.

What's really missing is a cook book/set of recipes/patterns. I'd be interested in helping to compile that.

@spaghettiguru
Copy link
Author

@jmealo , thanks man,
here is more detailed description of what we do:
we have a table with a column of type h3index and a few additional columns containing some associated data. The table is pretty big - on average 11 million new rows a day. Most of our queries return rows that have their h3index column value (transformed to geometry) intersect with a polygon or bbox, in addition to some other filters. The results of these queries are drawn on the map as H3 cells filled with colors derived from the data associated with a cell.

What I would expect ideally is that H3 support a function that accepts a polygon and an H3 resolution and returns the list of cell indexes in that resolution which intersect with that polygon. From what I understand, there is an ongoing effort to support the new H3 function that returns the list of cells which intersect with a given polygon but IMO that is not ideal when dealing with a large polygon and a high H3 resolution since the number of returned cells can be huge in such case.

@jmealo
Copy link
Contributor

jmealo commented Jan 7, 2025

@spaghettiguru The good news is, for your use case, you shouldn't need to convert between H3 and geometry at all. This completely sidesteps/avoids the issue reported in this ticket. It was a little unclear for me how to use h3-pg properly, and, nobody has written extensive usage examples on it yet. Let me know if the following is helpful, perhaps we'll adapt it to the official documentation, or I'll make a blog post.

Here's a contrived example demonstrates how to efficiently handle spatial + temporal queries in PostgreSQL using H3 indexes. It shows a pattern that works well for complex use cases (multiple geometry types, high volume, temporal data) while noting where simplifications can be made for simpler scenarios.

The example uses a parking lot management system to illustrate the concepts, but the patterns can be applied to any spatial + temporal data.

Helper functions

These convert geography/geometry to a set of H3 cells at a given resolution.

@zachasme and I are discussing adding these (or something similar) to h3-pg, let us know if they handle your use case well.

create function public.h3_geography_to_cells(geog geography, resolution integer) returns SETOF h3index
    immutable
    strict
    parallel safe
    language plpgsql
as
$$
DECLARE
    geom_type text;
BEGIN
    -- Get the geometry type
    geom_type := ST_GeometryType(geog::geometry);

    -- Handle different geometry types
    CASE
        WHEN geom_type = 'ST_Point' THEN
            -- For points, use h3_geo_to_h3
            RETURN QUERY SELECT h3_lat_lng_to_cell(geog, resolution);

        WHEN geom_type = 'ST_LineString' THEN
            -- For linestrings, convert to polygon and use h3_polygon_to_cells
            RETURN QUERY SELECT h3_polygon_to_cells(ST_Buffer(geog, 0.00001), resolution);

        WHEN geom_type IN ('ST_Polygon', 'ST_MultiPolygon') THEN
            -- For polygons and multipolygons, use h3_polygon_to_cells directly
            RETURN QUERY SELECT h3_polygon_to_cells(geog, resolution);

        ELSE
            RAISE EXCEPTION 'Unsupported geometry type: %', geom_type;
    END CASE;
END;
$$;

create function public.h3_geography_to_cells_buffered(geog geography, resolution integer, buffer_meters double precision DEFAULT 0) returns SETOF h3index
    immutable
    strict
    parallel safe
    language plpgsql
as
$$
DECLARE
    buffered_geog geography;
BEGIN
    -- Apply buffer if specified
    IF buffer_meters > 0 THEN
        buffered_geog := ST_Buffer(geog, buffer_meters)::geography;
    ELSE
        buffered_geog := geog;
    END IF;

    -- call existing function
    RETURN QUERY SELECT * FROM h3_geography_to_cells(buffered_geog, resolution);

END;
$$;

create function public.h3_geometry_to_cells(geom geometry, resolution integer) returns SETOF h3index
    immutable
    strict
    parallel safe
    language plpgsql
as
$$
DECLARE
    geom_type text;
BEGIN
    -- Get the geometry type
    geom_type := ST_GeometryType(geom);

    -- Handle different geometry types
    CASE
        WHEN geom_type = 'ST_Point' THEN
            -- For points, use h3_geo_to_h3
            RETURN QUERY SELECT h3_lat_lng_to_cell(geom, resolution);

        WHEN geom_type = 'ST_LineString' THEN
            -- For linestrings, convert to polygon and use h3_polygon_to_cells
            RETURN QUERY SELECT h3_polygon_to_cells(ST_Buffer(geom, 0.00001), resolution);

        WHEN geom_type IN ('ST_Polygon', 'ST_MultiPolygon') THEN
            -- For polygons and multipolygons, use h3_polygon_to_cells directly
            RETURN QUERY SELECT h3_polygon_to_cells(geom, resolution);

        ELSE
            RAISE EXCEPTION 'Unsupported geometry type: %', geom_type;
    END CASE;
END;
$$;

Example schema

I didn't test these queries exactly, but. they're adapted from a working schema.

-- Core tables
CREATE TABLE lots (
    lot_id uuid PRIMARY KEY,
    lot_name text NOT NULL,
    location geography(POINT) NOT NULL
);

CREATE TABLE spots (
    spot_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    lot_id uuid REFERENCES lots,
    spot_number text NOT NULL,
    created timestamptz DEFAULT CURRENT_TIMESTAMP NOT NULL,
    deleted timestamptz
);

-- H3 spatial index table
CREATE TABLE spot_hexagons (
    spot_id bigint NOT NULL REFERENCES spots,
    hex_id h3index NOT NULL,
    PRIMARY KEY (spot_id, hex_id)
);

CREATE INDEX ON spot_hexagons(hex_id);

-- Example occupancy history partitioned by hash
CREATE TABLE occupancy_history (
    spot_id bigint NOT NULL REFERENCES spots,
    status text NOT NULL, -- 'occupied', 'available', 'reserved', etc
    created timestamptz DEFAULT CURRENT_TIMESTAMP NOT NULL,
    expires_at timestamptz,
    PRIMARY KEY (spot_id, created)
) PARTITION BY HASH (spot_id);

Example query

This query finds parking spots and their current occupancy.

WITH input_point AS (
    -- Convert input point to geography 
    SELECT ST_GeomFromGeoJSON(%L)::geography AS geog
),
nearby_hexes AS (
    -- Get H3 cells within buffer of point
    SELECT h3_geography_to_cells_buffered(
        (SELECT geog FROM input_point),
        %(resolution)s,  -- e.g. 9 
        %(radius_m)s    -- e.g. 1000
    ) AS hex_id
),
matching_spots AS (
    -- Find spots in those H3 cells
    SELECT DISTINCT s.spot_id,
           s.lot_id,
           s.spot_number,
           l.lot_name,
           s.created,
           s.deleted
    FROM spots s
    JOIN spot_hexagons h ON s.spot_id = h.spot_id
    JOIN lots l ON s.lot_id = l.lot_id
    WHERE h.hex_id IN (SELECT hex_id FROM nearby_hexes)
      AND (s.deleted IS NULL OR s.deleted > NOW())
)
-- Get latest occupancy for each matching spot
SELECT s.*,
       o.status AS occupancy_status,
       o.created AS status_updated,
       o.expires_at
FROM matching_spots s
LEFT JOIN LATERAL (
    SELECT spot_id, status, created, expires_at 
    FROM occupancy_history
    WHERE spot_id = s.spot_id
      AND created <= NOW()
    ORDER BY created DESC
    LIMIT 1
) o ON true;

Explanation

This example demonstrates an efficient pattern for spatial + temporal queries using H3 indexes in PostgreSQL using h3-pg and PostGIS.

Schema Design Pattern

The schema follows these principles:

  • Core entities (lots, spots) contain business data
  • H3 spatial indexing (spot_hexagons) maintained separately
  • Historical data (occupancy_history) partitioned for write performance
  • Clear separation between current state and history

Why This Pattern Works Well

Efficient Spatial Queries:

  • Uses H3 as a pre-filter before expensive PostGIS operations
  • Hexagon lookups are extremely fast with proper indexing
  • Reduces the dataset before any exact distance calculations

Complex Geometries Support:

  • Helper functions handle any geometry type (points, lines, polygons)
  • One-to-many relationship between spots and hexagons handles objects spanning multiple cells
  • Buffering support for radius searches

Temporal Data Management:

  • Partitioned history table scales well for high-volume writes
  • LATERAL joins efficiently fetch latest status
  • Soft deletes preserve history while maintaining performance

Partitioning matters

  • Historical queries for a given entity can be optimized depending on how you partition:
  • Having all the history for an entity in one table/index allows for an index-only scan
  • Having the history for entity spread across partitions (likely by time) allows for query parallelization
    Given your data volume, you'll probably end up partitioning, and may want to test both strategies. If your DB instance isn't going to have a ton of CPU cores, you're likely better off with a partitioning strategy that optimizes for index-only scans.

When to Use This Pattern

Best suited for:

  • Large datasets (hundreds of millions of records)
  • Mixed geometry types
  • High write volume for status changes
  • Frequent spatial radius/polygon searches
  • Need for historical state tracking
  • Distributed systems (UUID) -- otherwise use BIGINT for identity columns

Consider simplifying if:

  • All geometries are points (can use single H3 column)
  • Dataset is small (standard PostGIS might be sufficient)
  • No need for temporal history
  • Single-resolution queries only

⚠️ If you can represent everything by using a particular resolution or your data is point data, you should remove the hexagons table entirely.

Implementation Notes

Resolution Choice:

  • Higher resolution = more precise but more storage
  • Choose based on typical query radius and geometry size
  • Single resolution per table (add tables if multiple needed)
  • Reference this table: https://h3geo.org/docs/core-library/restable/

Helper Functions:

  • h3_geography_to_cells: Converts any geography to H3 cells
  • h3_geography_to_cells_buffered: Adds radius support
  • Makes complex geometry handling transparent
  • You can create geometry variants using casts if necessary.

Query Structure:

  • CTE-based for clarity and potential materialization

  • Filters early on H3 cells

  • Uses LATERAL joins for efficient latest-state lookup

  • If precision is critical, after all filtering is done, you can add a filter:
    AND ST_DWithin(l.location, input_point.geog, %(radius_m)s) This should be one of the last operations since it's more computationally expensive than H3 lookups. By running it on the filtered dataset, we minimize the number of distance calculations.

  • You need to make use of EXPLAIN (ANALYZE, VERBOSE, BUFFERS) to make sure the query is making good use of your indexes and not doing unnecessary calculations.

  • Depending on how complex your CTE is and whether any portion of the CTE is referenced multiple times, you'll want to familiarize yourself with materialize, see: https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-CTE-MATERIALIZATION

Performance Considerations:

  • Index on hex_id crucial for performance
  • Partitioning helps manage large history tables
  • Consider maintenance scripts for orphaned hexagon records if foreign key constraints impact performance negatively

Maintaining the Hexagon Table

When storing both raw geometries and H3 indexes, you'll want to keep them in sync. Here's a pattern using triggers:

-- Add geometry column to spots table
ALTER TABLE spots 
    ADD COLUMN location geography(POINT);

-- Function to calculate hexagons for a spot
CREATE OR REPLACE FUNCTION update_spot_hexagons()
    RETURNS TRIGGER
    LANGUAGE plpgsql
AS $$
BEGIN
    -- Delete existing hexagons for this spot
    DELETE FROM spot_hexagons 
    WHERE spot_id = NEW.spot_id;

    -- Insert new hexagons
    -- Note: Resolution (9) should be configured based on your needs
    INSERT INTO spot_hexagons (spot_id, hex_id)
    SELECT NEW.spot_id, h3_index
    FROM h3_geography_to_cells(NEW.location, 9) AS h3_index;

    RETURN NEW;
END;
$$;

-- Trigger to maintain hexagons on insert/update
CREATE TRIGGER maintain_spot_hexagons
    AFTER INSERT OR UPDATE OF location
    ON spots
    FOR EACH ROW
    WHEN (NEW.location IS NOT NULL)
    EXECUTE FUNCTION update_spot_hexagons();

-- Optional: Trigger to clean up hexagons on delete
CREATE TRIGGER cleanup_spot_hexagons
    AFTER DELETE
    ON spots
    FOR EACH ROW
    EXECUTE FUNCTION 
        (DELETE FROM spot_hexagons WHERE spot_id = OLD.spot_id);

Benefits of the trigger approach

  • Keeps original geometry for exact calculations when needed
  • Automatically maintains H3 indexes
  • Handles updates/deletes cleanly
  • Can be extended to handle multiple resolutions

Note that for bulk operations, you might want to disable triggers temporarily and rebuild the hexagon table using a batch process for better performance.

If your geometries are complex or you're indexing at multiple resolutions, consider adding appropriate indexes and potentially partitioning the hexagon table as well. The exact strategy will depend on your query patterns and data volume.

Further Reading

H3 Resources

PostgreSQL Resources

PostGIS Resources

@spaghettiguru
Copy link
Author

@jmealo thanks for such an elaborate response, man, this is very nice of you. While I was familiar with some of this stuff, there is a lot of useful info here and some great tips!
While I see how your suggestion of using h3index column indexing + h3_polygon_to_cells + filtering the rows by checking for inclusion in the resulting cells list works well in some scenarios, I am not sure it can be a substitute for doing ST_Intersect in others. I have 2 main concerns with this approach:

  1. h3_polygon_to_cells may return millions of items at some H3 resolutions which may make it less effective when the returned cells are compared against in WHERE clause.
  2. h3_polygon_to_cells only returns the cells that are fully contained inside the given polygon which makes it less precise in cases when partial containment is also of interest (although you suggested how it may be addressed, it still doesn't seem very precise)

In any case, we will give it a try and see if it solves the original issue.
I suggest not to close this issue yet.

Thanks again for your help!

@jmealo
Copy link
Contributor

jmealo commented Jan 9, 2025

@spaghettiguru: No problem! Glad I could help.

You raise good points about the trade-offs, but let me clarify something important about operation costs. While a million H3 comparisons might sound expensive, they're actually much cheaper than even a few hundred geospatial operations, especially with complex geometries. This is because H3 comparisons are basically integer operations with good memory locality, while geospatial operations involve complex floating-point math and potentially scattered memory access.

A few key points to consider:

Query Optimization:

  • You can use JOIN with set-returning functions instead of WHERE clauses (though this is more of an implementation detail)
  • The key is picking the right H3 resolution for your use case - we use resolution 7 because it gives us the right balance of precision and performance

Mental Model:

  • Think of H3 as your first-pass filter
  • Geospatial queries are expensive (even with indexes)
  • H3-indexed queries behave more like regular B-tree indexed queries - they're just integer comparisons

Best Practices:

  • Structure your CTEs to do H3 filtering first
  • Only perform PostGIS operations on the pre-filtered result set
  • Always verify your query plan with EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
  • Use Index Only Scans where possible - if your H3 index can answer the query, it'll be dramatically faster than any geospatial operation

The key insight is that H3 trades exact calculations for fast approximations. Use it to quickly narrow down your dataset, then apply precise geospatial operations only on that smaller set. When properly indexed, these queries will perform more like standard SQL queries rather than expensive geospatial ones.

This strategy works best when you can standardize on a single resolution per table/column, as it simplifies both indexing and query patterns.

@spaghettiguru
Copy link
Author

Thanks a lot, @jmealo , these are important insights that will surely be useful not just to our team but to others as well. Really grateful for your help! Have a great week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted ⛏️ Extra attention is needed needs more info 🖐️ More info is required
Projects
None yet
Development

No branches or pull requests

3 participants