Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache_compiler return value #47

Open
mdavis-xyz opened this issue Jan 3, 2025 · 5 comments
Open

cache_compiler return value #47

mdavis-xyz opened this issue Jan 3, 2025 · 5 comments

Comments

@mdavis-xyz
Copy link
Contributor

mdavis-xyz commented Jan 3, 2025

I want to use nemosis to download files and convert them to parquet, then use polars to query the data. (Polars doesn't support feather.)

The docs say:

cache_compiler ... may be useful if you're using NEMOSIS to build a data cache, but then process the cache using other packages or applications

which sounds like my use case.

I've called:

cache_compiler(start_time, end_time, table, raw_data_cache, keep_csv=True, fformat='parquet')

I can see some parquet files when I look in the cache folder. However how am I supposed to get the relevant list of files?

I can try to look in the folder myself. However that's a bit fiddly. Due to the recent filename change on nemweb, the filenames won't be in a consistent format. Additionally, if I remember correctly, the tables/files BIDPEROFFER and BIDOFFERPERIOD were previously named with filenames that didn't match the contents, but that was changed recently. I expect there are similar stories with other tables. I'm sure nemosis has already handled that messiness when downloading the files. So I'd like it if nemosis can handle it once more when I want to read files from disk.

What I would like is if cache_compiler returned the list of parquet file paths, as a list of strings. (Currently it returns nothing.)

Alternatively, if the cache was structured into one subdirectory per table, that would make listing files trivial.

@nick-gorman
Copy link
Member

nick-gorman commented Jan 4, 2025

Hi Matt,

What you're saying makes sense, particularly as naming conventions have changed overtime.

However, a simple workaround may be to use glob patterns to query the cache.

Does using something like the code below solve the problem?

df = pl.scan_parquet("path/to/files/*BIDOFFERPERIOD*.parquet")

Or you might be able to use a glob pattern to get all the file names with BIDOFFERPERIOD in them, and then iterate through them.

@mdavis-xyz
Copy link
Contributor Author

Yes that's a good workaround! I forgot about glob patterns.

I'll maybe write the code for this new feature anyway, some time in the next few weeks.

@mdavis-xyz
Copy link
Contributor Author

Just FYI for anyone else using polars + nemosis, apparently polars does support feather. The function is scan_ipc.. (I was expecting scan_feather.)

@nick-gorman
Copy link
Member

Yes that's a good workaround! I forgot about glob patterns.

I'll maybe write the code for this new feature anyway, some time in the next few weeks.

Sounds good, definitely would be a handy feature to have!

@mdavis-xyz
Copy link
Contributor Author

Just FYI, the glob pattern workaround doesn't work. Polars' scan_parquet function can't read in multiple files at once if the schemas are different (which is the case for most AEMO tables over a long time span). Either you get a panic or some columns are dropped. Instead you need to scan each one separately and pl.concat(how='diagonal_relaxed') them together.

I'll get around to writing this PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants