Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor update to documentation #97

Open
D3SL opened this issue Oct 14, 2024 · 1 comment
Open

Minor update to documentation #97

D3SL opened this issue Oct 14, 2024 · 1 comment

Comments

@D3SL
Copy link

D3SL commented Oct 14, 2024

For "other tools": Polars, Python's equivalent to data.table but with more tidy-like syntax, directly supports parquet including "lazy" operations.

For limitations: This may be parquet in general but loading a parquet file in Arrow keeps it locked somehow. Attempting to write to that same file from python while the dataframe exists in R causes problems. Sometimes the lock is so bad I need to rm() the file in R, force a garbage collect, and then delete the parquet file to get new data loading into R again.

If nanoparquet doesn't have this limitation that would be a very significant improvement in QOL, if it does it's worth mentioning.

@gaborcsardi
Copy link
Member

Thanks! It seems to me that Polars internally uses arrow for at least some of the parquet functionality. Am I missing something here?
https://github.com/pola-rs/polars/blob/ca21bd7f06c88954e9c1d647c35413fec6121d22/crates/polars-parquet/Cargo.toml#L17
We can still mention Polars of course.

As for the locking, we don't explicitly lock the file, although Windows might lock it implicitly while we are reading for it, but after that it should be unlocked. Maybe arrow only locks it if you are using ALTREP, which is the default now? In any case, this seems too technical to me to mention up front in a README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants