Allow custom parquet schema #330

BramVanroy · 2025-01-26T13:51:04Z

In my data fields I have items that can be None or string, e.g. doc.metadata["license"] = None. But if the first document that is written to the file has a None value, parquet will set this field's type to an explicit null, so when the writer tries to add a new document that has a non-null value, there will be a conflict.

Explicitly: if the first item is null, the schema will automatically be set it to pa.null() but if the next document has a string, it will try using pa.string(), which then conflicts with the previous null(). To avoid this, we can explicitly pass a schema that sets the type to pa.string() to begin with (which allows null), so that the strict null() is avoided altogether.

So this PR simply adds the option to pass a Schema to the ParquetWriter (and the HuggingfaceWriter). Default behavior remains the same as before.

guipenedo · 2025-01-26T15:35:59Z

Since pyarrow is an optional dependency, the imports need to be local (see the existing pyarrow imports)

BramVanroy · 2025-01-26T15:44:31Z

Fixed - though now the typing is very generic (just Any). Not sure if/how that can be improved though.

allow custom schema

f95d532

fix pa import

17fdfe4

add schema to batch writing

4f6d584

guipenedo merged commit b105dcd into huggingface:main Jan 30, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom parquet schema #330

Allow custom parquet schema #330

BramVanroy commented Jan 26, 2025

guipenedo commented Jan 26, 2025

BramVanroy commented Jan 26, 2025

Allow custom parquet schema #330

Allow custom parquet schema #330

Conversation

BramVanroy commented Jan 26, 2025

guipenedo commented Jan 26, 2025

BramVanroy commented Jan 26, 2025