cache_compiler return value #47

mdavis-xyz · 2025-01-03T19:16:40Z

I want to use nemosis to download files and convert them to parquet, then use polars to query the data. (Polars doesn't support feather.)

The docs say:

cache_compiler ... may be useful if you're using NEMOSIS to build a data cache, but then process the cache using other packages or applications

which sounds like my use case.

I've called:

cache_compiler(start_time, end_time, table, raw_data_cache, keep_csv=True, fformat='parquet')

I can see some parquet files when I look in the cache folder. However how am I supposed to get the relevant list of files?

I can try to look in the folder myself. However that's a bit fiddly. Due to the recent filename change on nemweb, the filenames won't be in a consistent format. Additionally, if I remember correctly, the tables/files BIDPEROFFER and BIDOFFERPERIOD were previously named with filenames that didn't match the contents, but that was changed recently. I expect there are similar stories with other tables. I'm sure nemosis has already handled that messiness when downloading the files. So I'd like it if nemosis can handle it once more when I want to read files from disk.

What I would like is if cache_compiler returned the list of parquet file paths, as a list of strings. (Currently it returns nothing.)

Alternatively, if the cache was structured into one subdirectory per table, that would make listing files trivial.

The text was updated successfully, but these errors were encountered:

nick-gorman · 2025-01-04T01:50:09Z

Hi Matt,

What you're saying makes sense, particularly as naming conventions have changed overtime.

However, a simple workaround may be to use glob patterns to query the cache.

Does using something like the code below solve the problem?

df = pl.scan_parquet("path/to/files/*BIDOFFERPERIOD*.parquet")

Or you might be able to use a glob pattern to get all the file names with BIDOFFERPERIOD in them, and then iterate through them.

mdavis-xyz · 2025-01-05T07:23:04Z

Yes that's a good workaround! I forgot about glob patterns.

I'll maybe write the code for this new feature anyway, some time in the next few weeks.

mdavis-xyz · 2025-01-05T08:08:22Z

Just FYI for anyone else using polars + nemosis, apparently polars does support feather. The function is scan_ipc.. (I was expecting scan_feather.)

nick-gorman · 2025-01-06T02:46:40Z

Yes that's a good workaround! I forgot about glob patterns.

I'll maybe write the code for this new feature anyway, some time in the next few weeks.

Sounds good, definitely would be a handy feature to have!

mdavis-xyz · 2025-01-09T08:51:07Z

Just FYI, the glob pattern workaround doesn't work. Polars' scan_parquet function can't read in multiple files at once if the schemas are different (which is the case for most AEMO tables over a long time span). Either you get a panic or some columns are dropped. Instead you need to scan each one separately and pl.concat(how='diagonal_relaxed') them together.

I'll get around to writing this PR soon.

mdavis-xyz mentioned this issue Jan 6, 2025

Add developer instructions #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache_compiler return value #47

cache_compiler return value #47

mdavis-xyz commented Jan 3, 2025 •

edited

Loading

nick-gorman commented Jan 4, 2025 •

edited

Loading

mdavis-xyz commented Jan 5, 2025

mdavis-xyz commented Jan 5, 2025

nick-gorman commented Jan 6, 2025

mdavis-xyz commented Jan 9, 2025

cache_compiler return value #47

cache_compiler return value #47

Comments

mdavis-xyz commented Jan 3, 2025 • edited Loading

nick-gorman commented Jan 4, 2025 • edited Loading

mdavis-xyz commented Jan 5, 2025

mdavis-xyz commented Jan 5, 2025

nick-gorman commented Jan 6, 2025

mdavis-xyz commented Jan 9, 2025

mdavis-xyz commented Jan 3, 2025 •

edited

Loading

nick-gorman commented Jan 4, 2025 •

edited

Loading