Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Options to validate JSON fields #15222

Open
Tracked by #9
revans2 opened this issue Mar 4, 2024 · 2 comments
Open
Tracked by #9

[FEA] Options to validate JSON fields #15222

revans2 opened this issue Mar 4, 2024 · 2 comments
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 4, 2024

Is your feature request related to a problem? Please describe.
Apache Spark optionally validates several things according to the JSON spec that the CUDF parser does not currently validate.

https://www.json.org/json-en.html

The reason this is a problem is that Spark will return a null for any JSON input that violates the spec. So if there is a row where part of it is not valid we need a way to make sure that we return a null for that.

Ideally we want to do something like #14996 which is a huge performance win for us, or ask for CUDF to return nested types as strings for us. If CUDF does not do the validation in those cases we will not even see that data and end up returning an incorrect value. But even without this there is some string validation that involves escape sequences and we cannot validate it ourselves because CUDF has already processed the escape sequences in many cases.

There are a few places where CUDF is not validating the JSON, and it appears to really be in values.

According to the spec a value that is not a string, object, or array must be true, false, null, or a number. It appears that CUDF accepts most unquoted value as valid. Spaces in the middle of an entry appears to make it invalid.

Spark does not have any configs to enable/disable this type of validation.

Again according to the spec a number should match the regular expression "^-?(?:(?:[1-9][0-9]*)|0)(?:\.[0-9]+)?(?:[eE][\-\+]?[0-9]+)?$"
Spark does have a few options related to numbers.

  1. They have an option to enable leading zeros, which changes the regular expression to look more like "^-?[0-9]+(?:\.[0-9]+)?(?:[eE][\-\+]?[0-9]+)?$". This is not on by default so it is okay if CUDF does not try to support this, but I did want to call it out.
  2. Spark also has an option to include NaN, "+INF", "-INF", "+Infinity", "Infinity", and "-Infinity". Sadly this is on by default. Not sure if we could just include an allow list similar to how CSV handles boolean values.

According to the JSON spec a quoted string is allowed to only have a very small number of things that can be escaped with a backslash. ", \, /, b, f, n, r, t, and u followed by 4 hex digits. Spark has a config to disable this and allows escaping of any character, including \u without the hex digits. As this is enabled by default in Spark we are fine if this check is not implemented, but I wanted to document it.

The JSON Spec also says that a quoted string cannot have "control character" in it. Here a control character appears to be anything between \u0000 and \u001f inclusive. Spark does enforce this by default, but it varies by the JSON command used, like get_json_object has the check disabled. This is something that we eventually will need support for.

Describe the solution you'd like
I would like a few configs for the JSON reader that would let us pass in options to enable/disable validation based on things similar to what Spark does today.

We already support this more, or less for single quoted strings, and it would be great to extend it to include validation of numbers with/without leading zeros, and with/without an allow list of special cases; and validation of unescaped control characters.

@revans2
Copy link
Contributor Author

revans2 commented Mar 14, 2024

I should add that I have a bit more information about backslash escaping

#15303

Specifically if we enable allowing single quotes escaping single quotes also works \' in addition to \", or if we allow escaping any character. If we have both of these disabled, then \' is invalid no matter what string it appears in.

rapids-bot bot pushed a commit that referenced this issue Sep 11, 2024
Addresses part of #15222
This change adds validation stage in JSON reader at tokens level. If any validation fails in a row, it will make the entire row as null.

- [x] validation functor - implement spark validation rules. (@revans2 implemented all validation rules)
- [x] move output iterator to thrust. (already merged by NVIDIA/cccl#2282)
- [x] Fix failing tests and infer data type for Float.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15968
@karthikeyann karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024
@shrshi
Copy link
Contributor

shrshi commented Nov 29, 2024

Progress summary:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants