-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Options to validate JSON fields #15222
Labels
Milestone
Comments
This was referenced Mar 13, 2024
Open
I should add that I have a bit more information about backslash escaping Specifically if we enable allowing single quotes escaping single quotes also works |
6 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Sep 11, 2024
Addresses part of #15222 This change adds validation stage in JSON reader at tokens level. If any validation fails in a row, it will make the entire row as null. - [x] validation functor - implement spark validation rules. (@revans2 implemented all validation rules) - [x] move output iterator to thrust. (already merged by NVIDIA/cccl#2282) - [x] Fix failing tests and infer data type for Float. Authors: - Karthikeyan (https://github.com/karthikeyann) - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #15968
Progress summary:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Is your feature request related to a problem? Please describe.
Apache Spark optionally validates several things according to the JSON spec that the CUDF parser does not currently validate.
https://www.json.org/json-en.html
The reason this is a problem is that Spark will return a null for any JSON input that violates the spec. So if there is a row where part of it is not valid we need a way to make sure that we return a null for that.
Ideally we want to do something like #14996 which is a huge performance win for us, or ask for CUDF to return nested types as strings for us. If CUDF does not do the validation in those cases we will not even see that data and end up returning an incorrect value. But even without this there is some string validation that involves escape sequences and we cannot validate it ourselves because CUDF has already processed the escape sequences in many cases.
There are a few places where CUDF is not validating the JSON, and it appears to really be in values.
According to the spec a value that is not a string, object, or array must be
true
,false
,null
, or a number. It appears that CUDF accepts most unquoted value as valid. Spaces in the middle of an entry appears to make it invalid.Spark does not have any configs to enable/disable this type of validation.
Again according to the spec a number should match the regular expression "^-?(?:(?:[1-9][0-9]*)|0)(?:\.[0-9]+)?(?:[eE][\-\+]?[0-9]+)?$"
Spark does have a few options related to numbers.
NaN
, "+INF", "-INF", "+Infinity", "Infinity", and "-Infinity". Sadly this is on by default. Not sure if we could just include an allow list similar to how CSV handles boolean values.According to the JSON spec a quoted string is allowed to only have a very small number of things that can be escaped with a backslash.
"
,\
,/
,b
,f
,n
,r
,t
, andu
followed by 4 hex digits. Spark has a config to disable this and allows escaping of any character, including\u
without the hex digits. As this is enabled by default in Spark we are fine if this check is not implemented, but I wanted to document it.The JSON Spec also says that a quoted string cannot have "control character" in it. Here a control character appears to be anything between
\u0000
and\u001f
inclusive. Spark does enforce this by default, but it varies by the JSON command used, likeget_json_object
has the check disabled. This is something that we eventually will need support for.Describe the solution you'd like
I would like a few configs for the JSON reader that would let us pass in options to enable/disable validation based on things similar to what Spark does today.
We already support this more, or less for single quoted strings, and it would be great to extend it to include validation of numbers with/without leading zeros, and with/without an allow list of special cases; and validation of unescaped control characters.
The text was updated successfully, but these errors were encountered: