Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In my data fields I have items that can be None or string, e.g.
doc.metadata["license"] = None
. But if the first document that is written to the file has a None value, parquet will set this field's type to an explicitnull
, so when the writer tries to add a new document that has a non-null value, there will be a conflict.Explicitly: if the first item is null, the schema will automatically be set it to
pa.null()
but if the next document has a string, it will try usingpa.string()
, which then conflicts with the previousnull()
. To avoid this, we can explicitly pass a schema that sets the type topa.string()
to begin with (which allows null), so that the strictnull()
is avoided altogether.So this PR simply adds the option to pass a Schema to the ParquetWriter (and the HuggingfaceWriter). Default behavior remains the same as before.