Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom parquet schema #330

Merged
merged 3 commits into from
Jan 30, 2025
Merged

Allow custom parquet schema #330

merged 3 commits into from
Jan 30, 2025

Conversation

BramVanroy
Copy link
Contributor

In my data fields I have items that can be None or string, e.g. doc.metadata["license"] = None. But if the first document that is written to the file has a None value, parquet will set this field's type to an explicit null, so when the writer tries to add a new document that has a non-null value, there will be a conflict.

Explicitly: if the first item is null, the schema will automatically be set it to pa.null() but if the next document has a string, it will try using pa.string(), which then conflicts with the previous null(). To avoid this, we can explicitly pass a schema that sets the type to pa.string() to begin with (which allows null), so that the strict null() is avoided altogether.

So this PR simply adds the option to pass a Schema to the ParquetWriter (and the HuggingfaceWriter). Default behavior remains the same as before.

@guipenedo
Copy link
Collaborator

Since pyarrow is an optional dependency, the imports need to be local (see the existing pyarrow imports)

@BramVanroy
Copy link
Contributor Author

Fixed - though now the typing is very generic (just Any). Not sure if/how that can be improved though.

@guipenedo guipenedo merged commit b105dcd into huggingface:main Jan 30, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants