-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temp files not deleted on save for spatial addon #439
Comments
Thanks for opening this issue! Based on our automated check, it seems that your post contains some code but it does not use code blocks to format it. Please double-check your post and revise it if necessary. To employ syntax highlighting, it's recommended to use code blocks with triple backticks, e.g.:
If this is a false positive, feel free to disregard this comment. |
Hi! DuckDB spatial (like almost all geospatial libraries) uses GDAL to write FGB files. We don't know where, how and when GDAL creates temporary files so it's not really feasible for us to clean them up after a crash. With that said I also don't think you will see any meaningful differences between DuckDBs FGB output performance compared to other systems, as we all use GDAL under the hood. In essence, you'd only be benchmarking how fast DuckDB can convert geometries into GDAL features, but thats not making the most out of duckdb as it is a single-threaded row-by-row process. I'd actually expect e.g. Geopandas to be a lot faster at doing that now because they can pass geoarrow arrays directly to GDAL while we still construct the geometries one at a time. |
I want to make clear, that these temporary files are neither deleted upon crash nor on happy path of hyperfine. But in the end it should not matter, as when hyperfine crashes on the x-th run, than the python code closed for each previous run successfully. To the off-top of profiling, fgb. As opening and saving files are an integral part of the analysis, I need to be profiled two. I only used fgb, as I could not write to gpkg. I also wrote to geoparquet and that was really fast, I think about 30 % even faster than GeoPandas. |
What happens?
I tried to benchmark the spatial Plugin.
With hyperfine I did 13 repetitions.
Each creating a fgb-file.
The open space on drive C: was reduced heavily. Benchmark crashed, when no save left.
Found out that on the Temp dir a copy of each output run was stored und never freed.
See Code:
https://github.com/sehHeiden/geospeed/blob/master/geospeed/duckdb_speed.py
To Reproduce
"""Test the execution speed for an equivalent of Geopandas Overlay with intersection using a spatial index."""
import time
from pathlib import Path
import duckdb
save_name = None
con = duckdb.connect(save_name if save_name else ":memory:", config={"threads": 15, "memory_limit": "20GB"})
con.install_extension("spatial")
con.load_extension("spatial")
con.sql("DROP TABLE IF EXISTS buildings")
con.sql("DROP TABLE IF EXISTS parcels")
con.sql("DROP TABLE IF EXISTS buildings_intersection")
con.sql("DROP VIEW IF EXISTS intersections")
Check supported formats for writing
formats = con.sql("SELECT short_name FROM ST_Drivers() WHERE can_create;").fetchall()
print(f"Writable formats: {formats}")
start = time.time()
shapefile_dir = Path("./ALKIS") # Base directory
building_files = list(shapefile_dir.glob("/GebauedeBauwerk.shp")) # Glob pattern for the subdirectories
parcel_files = list(shapefile_dir.glob("/NutzungFlurstueck.shp")) # Glob pattern for the subdirectories
Error handling if no files are found
if not building_files or not parcel_files:
load_error_txt = "No shapefiles found in the provided directory."
raise FileNotFoundError(load_error_txt)
Create then insert
con.sql(f"CREATE TABLE buildings AS SELECT * FROM ST_Read('{building_files[0].resolve()!s}');") # noqa: S608
con.sql(f"CREATE TABLE parcels AS SELECT * FROM ST_Read('{parcel_files[0].resolve()!s}');") # noqa: S608
con.execute(
"""PREPARE insert_buildings_stmt AS
INSERT INTO buildings SELECT *
FROM ST_Read($1)
WHERE oid NOT IN (SELECT oid FROM parcels);"""
)
con.execute(
"""PREPARE insert_parcels_stmt AS
INSERT INTO parcels SELECT *
FROM ST_Read($1)
WHERE oid NOT IN(SELECT oid FROM parcels);"""
)
Iterate over the found shapefiles and load them into DuckDB
for building_file, parcel_file in zip(building_files[1:], parcel_files[1:], strict=True):
# Insert into the existing table for subsequent shapefiles
con.execute(f"EXECUTE insert_buildings_stmt('{building_file.resolve()!s}')")
con.execute(f"EXECUTE insert_parcels_stmt('{parcel_file.resolve()!s}')")
Make the data valid
Make geometries valid
con.sql("""
UPDATE buildings SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);
UPDATE parcels SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);
""")
create indexes
con.sql("CREATE INDEX buildings_idx ON buildings USING RTREE (geom);")
con.sql("CREATE INDEX parcels_idx ON parcels USING RTREE (geom);")
print(f"DuckDB: Loading data takes: {(time.time() - start):.0f} s.")
Perform intersection
time_intersection = time.time()
con.sql("""
CREATE TABLE buildings_intersection AS
SELECT ST_Intersection(buildings.geom, parcels.geom) as geom,
buildings.oid AS building_oid,
parcels.oid AS parcel_oid
FROM buildings, parcels
WHERE ST_Intersects(buildings.geom, parcels.geom);
""")
Drop the indexes and unnecessary columns
con.sql("""
DROP INDEX buildings_idx;
DROP INDEX parcels_idx;
ALTER TABLE buildings DROP COLUMN geom;
ALTER TABLE parcels DROP COLUMN geom;
""")
Create final intersections table
con.sql("""
CREATE VIEW intersections AS
SELECT *
FROM buildings_intersection AS bi
JOIN buildings AS bs ON bi.building_oid = bs.oid
JOIN parcels AS ps ON bi.parcel_oid = ps.oid;
""")
con.sql("""UPDATE buildings_intersection SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);""")
print(f"DuckDB: Intersection takes: {(time.time() - time_intersection):.0f} s.")
if not save_name:
# Save the result to a file
time_writing = time.time()
con.sql("""
COPY(SELECT * EXCLUDE geom, ST_AsWKB(geom) AS geometry
FROM intersections
WHERE ST_IsValid(geom) AND NOT ST_IsEmpty(geom))
TO 'buildings_with_parcels.fgb' WITH(FORMAT GDAL, DRIVER 'FlatGeobuf', SRS 'EPSG:25833')""")
print(f"DuckDB: Saving takes: {(time.time() - time_writing):.0f} s.")
print(f"DuckDB: Total duration: {(time.time() - start):.0f} s.")
OS:
Windows
DuckDB Version:
1.1
DuckDB Client:
Python
Hardware:
30 GB free on C:, Ryzen 5800x 40 GB RAM
Full Name:
Sebastian Heiden
Affiliation:
Delphi-IMM, But this usage of DuckDB is private
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: