-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2gb parquet file takes 100s to process, even on second attempt (on main) #13785
Comments
Hi @TheBuilderJR thanks for opening the issue. Is there a way we could reproduce your results? Let me try to address some of it:
This is an expensive query because it has to:
DataFusion is a stateless query engine by default, so it won't cache anything, so the second query often doesn't run much faster than the first.
Recent versions mostly improves aggregation |
Thanks @Dandandan! Alas the data is proprietary, but I think if you just inserted random data with a timestamp, the performance on a mbp air m3 would probably be similar.
Is there no way we can default order by certain fields? I know clickhouse uses this to skip lots of unnecessary processing. Can we do the same? |
You can use the https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table For example CREATE EXTERNAL TABLE test (
c1 VARCHAR NOT NULL,
c2 INT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT NOT NULL,
c5 INT NOT NULL,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 BIGINT NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
-- this line tells DataFusion the data in the file is already ordered by (c2 ASC)
WITH ORDER (c2 ASC)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true'); |
Thanks @alamb. Is there a way to do this via code? This is currently how I write my parquet files via datafusion:
|
I think you can do it programatically by creating a And set |
@alamb oh I meant on the write path. I don't see anything in the dataframe.write_parquet API https://docs.rs/datafusion/latest/datafusion/config/struct.TableParquetOptions.html |
Perhaps https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.sort_by would work? |
Hmm... it looks like |
@TheBuilderJR if you are you willing to file a ticket for that feature I suspect someone would implement it pretty quickly |
I create a issue for the improvement, i will create a PR soon, thanks. |
Thank you for the quick turnaround. I've rebased on top of your changes but still seem to see growth in query times as the data size grows for a relatively simple ordered query
Here is the code for my read path
Here is the code for my write path
Is this expected? I would have imagined the cost should be constant since you can use the sort constraint to always scan a constant number of rows. |
Nvm this seems to work. It plateaus at a certain point. Thanks everyone! |
Thank you @TheBuilderJR for checking the merged PR, and further improvement will including in: After that, we don't need to add read option with the ordered info manually, when we select with order by column, if the column is written with order, the sort order info will automatically loading from parquet metadata. So the optimizer will using the info to optimize. |
Describe the bug
I expected based on the published benchmarks to have improvements, but I haven't seen any. I do see statistics are turned on in my parquet files and in theory optimizations in the last few releases should be kicking in, but they don't seem to be? Is there any guide on how to debug this? Are the main optimizations used in the benchmarks still hidden behind feature flags? If so is there a guide on how to turn on these flags to optimize for performance?
To Reproduce
Create a 2gb file (15m rows) of random data, run SELECT * FROM table ORDER by timestamp two times, see both times take over 100s
Expected behavior
Maybe first time is slow, but I expected second time to at least be faster. Ideally first time also utilizes the file statistics to run faster.
Additional context
This is consistent with past versions, but I upgraded 3 major version bumps in one go and expected some sort of noticeable improvement.
The text was updated successfully, but these errors were encountered: