[FEA] Options to validate JSON fields #15222

revans2 · 2024-03-04T17:37:14Z

Is your feature request related to a problem? Please describe.
Apache Spark optionally validates several things according to the JSON spec that the CUDF parser does not currently validate.

https://www.json.org/json-en.html

The reason this is a problem is that Spark will return a null for any JSON input that violates the spec. So if there is a row where part of it is not valid we need a way to make sure that we return a null for that.

Ideally we want to do something like #14996 which is a huge performance win for us, or ask for CUDF to return nested types as strings for us. If CUDF does not do the validation in those cases we will not even see that data and end up returning an incorrect value. But even without this there is some string validation that involves escape sequences and we cannot validate it ourselves because CUDF has already processed the escape sequences in many cases.

There are a few places where CUDF is not validating the JSON, and it appears to really be in values.

According to the spec a value that is not a string, object, or array must be true, false, null, or a number. It appears that CUDF accepts most unquoted value as valid. Spaces in the middle of an entry appears to make it invalid.

Spark does not have any configs to enable/disable this type of validation.

Again according to the spec a number should match the regular expression "^-?(?:(?:[1-9][0-9]*)|0)(?:\.[0-9]+)?(?:[eE][\-\+]?[0-9]+)?$"
Spark does have a few options related to numbers.

They have an option to enable leading zeros, which changes the regular expression to look more like "^-?[0-9]+(?:\.[0-9]+)?(?:[eE][\-\+]?[0-9]+)?$". This is not on by default so it is okay if CUDF does not try to support this, but I did want to call it out.
Spark also has an option to include NaN, "+INF", "-INF", "+Infinity", "Infinity", and "-Infinity". Sadly this is on by default. Not sure if we could just include an allow list similar to how CSV handles boolean values.

According to the JSON spec a quoted string is allowed to only have a very small number of things that can be escaped with a backslash. ", \, /, b, f, n, r, t, and u followed by 4 hex digits. Spark has a config to disable this and allows escaping of any character, including \u without the hex digits. As this is enabled by default in Spark we are fine if this check is not implemented, but I wanted to document it.

The JSON Spec also says that a quoted string cannot have "control character" in it. Here a control character appears to be anything between \u0000 and \u001f inclusive. Spark does enforce this by default, but it varies by the JSON command used, like get_json_object has the check disabled. This is something that we eventually will need support for.

Describe the solution you'd like
I would like a few configs for the JSON reader that would let us pass in options to enable/disable validation based on things similar to what Spark does today.

We already support this more, or less for single quoted strings, and it would be great to extend it to include validation of numbers with/without leading zeros, and with/without an allow list of special cases; and validation of unescaped control characters.

The text was updated successfully, but these errors were encountered:

revans2 · 2024-03-14T15:58:17Z

I should add that I have a bit more information about backslash escaping

#15303

Specifically if we enable allowing single quotes escaping single quotes also works \' in addition to \", or if we allow escaping any character. If we have both of these disabled, then \' is invalid no matter what string it appears in.

@revans2

Addresses part of #15222 This change adds validation stage in JSON reader at tokens level. If any validation fails in a row, it will make the entire row as null. - [x] validation functor - implement spark validation rules. (@revans2 implemented all validation rules) - [x] move output iterator to thrust. (already merged by NVIDIA/cccl#2282) - [x] Fix failing tests and infer data type for Float. Authors: - Karthikeyan (https://github.com/karthikeyann) - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #15968

shrshi · 2024-11-29T17:26:32Z

Progress summary:

Leading zeros for numeric values JSON reader validation of values #15968
Allow non-numeric numbers: NaN, +INF, -INF, +Infinity, Infinity, -Infinity JSON reader validation of values #15968
Allow unquoted control characters (excluding \t, \n, \r) JSON reader validation of values #15968
Allow single quotes JSON single quote normalization API #14729
Allow backslash escape any quoted character (e.g. \q)
Allow quoted control characters (excluding \t, \n, \r, \b, \u, \f)

revans2 added feature request New feature or request cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 4, 2024

revans2 mentioned this issue Mar 4, 2024

[BUG] Need Improved JSON Validation NVIDIA/spark-rapids#10534

Closed

GregoryKimball mentioned this issue Mar 14, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Closed

revans2 mentioned this issue Mar 15, 2024

[FEA] JSON number normalization when returned as a string #15318

Open

revans2 mentioned this issue Apr 11, 2024

[FEA] Write Spark compatible JSON validation function NVIDIA/spark-rapids-jni#1957

Open

karthikeyann mentioned this issue Jun 18, 2024

JSON reader validation of values #15968

Merged

6 tasks

karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024

GregoryKimball mentioned this issue Jan 10, 2025

[FEA] JSON reader performance projects #17718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Options to validate JSON fields #15222

[FEA] Options to validate JSON fields #15222

revans2 commented Mar 4, 2024

revans2 commented Mar 14, 2024

shrshi commented Nov 29, 2024 •

edited

Loading

[FEA] Options to validate JSON fields #15222

[FEA] Options to validate JSON fields #15222

Comments

revans2 commented Mar 4, 2024

revans2 commented Mar 14, 2024

shrshi commented Nov 29, 2024 • edited Loading

shrshi commented Nov 29, 2024 •

edited

Loading