Adds Threshold to RowBased checks #13

dougb · 2019-06-14T19:06:09Z

This pr adds thresholds to RowBased checks (negativeCheck, nullCheck, rangeCheck)
Thresholds lets the user define an acceptable error threshold that they can tolerate.
Currently if you have a single error in a Billion row table, the check will fail.
A threshold can be represented as a number of rows, or a percentage of the total number of rows processed.

type: negativeCheck
column: price
threshold: 2.3%

jhahnn21 · 2019-06-20T22:42:56Z

@dougb Why is this PR so big?

colindean · 2019-06-21T00:34:42Z

@jhahnn21 much of the added SLOC is test code.

phpisciuneri

just some minor things we can quickly address

phpisciuneri · 2019-06-28T14:29:16Z

README.md

+  The third section are the validators. To specify a validator, you
+first specify the type as one of the validators, then specify the
+arguments for that validator. Some of the validators support an error
+threshold. This options allows the user specify the number of errors


This option allows the user to specify...

phpisciuneri · 2019-06-28T14:30:42Z

README.md

+
+If the threshold is `< 1` it is considered a fraction of the row count. For example `0.25` would fail the check if more then `rowCount * 0.25` of the rows fail the check.
+If the threshold ends in a `%` its considered a percentage of the row count. For eample `33%` would fail the check if more then `rowCount * 0.33` of the rows fail the check.
+


👍 nice explanation

phpisciuneri · 2019-06-28T14:43:22Z

src/main/scala/com/target/data_validator/validator/NegativeCheck.scala

+    ("column", Json.fromString(column)),
+    ("threshold", Json.fromString(threshold.getOrElse("0"))),
+    ("failed", Json.fromBoolean(failed)),
+


can we remove this space :)

phpisciuneri · 2019-06-28T14:49:28Z

src/main/scala/com/target/data_validator/validator/RowBased.scala

-import io.circe.Json
-import io.circe.syntax._
+import com.target.data_validator.{ValidatorCheckEvent, ValidatorCounter, ValidatorError, ValidatorQuickCheckError}
+import com.target.data_validator.VarSubstitution


replace these two imports with import com.target.data_validator._

phpisciuneri · 2019-06-28T15:00:55Z

src/main/scala/com/target/data_validator/validator/RowBased.scala

+    * Calculates the max acceptable number of errors from threshold and rowCount.
+    * @param rowCount of table.
+    * @return max number of errors we can tolerate.
+    * if threshold < 0, then its a percentage of rowCount.


threshold < 1

phpisciuneri · 2019-06-28T15:22:15Z

src/test/scala/com/target/data_validator/validator/RangeCheckSpec.scala

+        val maxJson = Json.fromDouble(3.0)
+        val sut = RangeCheck("max",
+          None, maxJson, None, Some("30%")) // scalastyle:ignore
+        assert(true) // FIXME!!!!!


do we need this?

nope, We have other tests that check the calculation of threshold count.

phpisciuneri · 2019-06-28T15:30:03Z

src/test/scala/com/target/data_validator/validator/TestHelpersSpec.scala

+  }
+
+  describe("mkDict") {
+    it("simple case") {


Is this block needed?

yes, I fixed the test.

phpisciuneri · 2019-06-28T15:30:53Z

src/test/scala/com/target/data_validator/validator/TestHelpersSpec.scala

+package com.target.data_validator.validator
+
+import com.target.TestingSparkSession
+import com.target.data_validator.TestHelpers._


import com.target.TestHelpers._

I get error with import com.target.TestHelpers._

There are some more comments above that are being collapsed... Maybe you didn't see them? I suggested renaming the package of TestHelpers to match the location of the file in the test directory.

package com.target to correspond to location in directory

Github was hiding them. I will address them and update pr.

phpisciuneri · 2019-06-28T15:31:15Z

src/test/scala/com/target/data_validator/validator/TestHelpersSpec.scala

+import com.target.TestingSparkSession
+import com.target.data_validator.TestHelpers._
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{BooleanType, DoubleType, IntegerType, StringType, StructField, StructType}


import org.apache.spark.sql.types._

phpisciuneri · 2019-06-28T15:41:17Z

src/test/scala/com/target/data_validator/validator/RowBasedSpec.scala

+        val sut = NullCheck("col", Some("peanuts"))
+        assert(sut.configCheckThreshold)
+        assert(sut.failed)
+      }


I think we need to handle negative cases. The question is how? Do we error out, or just treat them as a threshold of 0?

it ("is negative") { val sut = NullCheck("col", Some("-10")) ??? } it ("is negative fraction") { val sut = NullCheck("col", Some("-0.1")) ??? } it ("is negative percent") { val sut = NullCheck("col", Some("-10%")) ??? }

I agree, added these and a couple more.

phpisciuneri · 2019-06-28T15:44:03Z

@jhahnn21 are you planning to review?

phpisciuneri · 2019-07-01T16:58:37Z

@dougb 👍

Doug Balog added 10 commits June 13, 2019 14:32

Adds some new functions to help with tests.

27bc47a

Upgrades sbt.version

8a159b5

Moves NegativeCheck and NullCheck out into its own modules.

78b6fc4

Fixes bug with Counter event.

b622ba0

Adds counter event to minNumRows report.

00bd79a

Switches to custom json decoders

a8b257b

Adds support to RowBased checks for thresholds.

633d1c4

Merge branch 'master' of github.com:target/data-validator into threshold

2ef9712

Adds missing section for parquetFile

3866e40

Adds description of threshold

0fadc3a

dougb changed the title ~~WIP: Adds Threshold to RowBased checks~~ Adds Threshold to RowBased checks Jun 19, 2019

dougb requested review from phpisciuneri and jhahnn21 June 19, 2019 02:29

Merge branch 'master' of github.com:target/data-validator into threshold

3caf2ef

dougb mentioned this pull request Jun 27, 2019

Refactor tests using traits #16

Open

Doug Balog added 3 commits June 27, 2019 10:27

Adds threshold feature to StringLengthCheck

00cb191

Eliminate IntelliJ warnings.

4979b99

Updates stringLengthCheck docs to include threshold.

efa28ab

phpisciuneri mentioned this pull request Jun 28, 2019

Adds ability to publish .jar to artifactory via env variables. #18

Merged

phpisciuneri suggested changes Jun 28, 2019

View reviewed changes

Doug Balog added 2 commits July 1, 2019 09:47

Made regex stricter to reject negative values.

430f738

Updates from code review.

9f6b242

dougb requested a review from phpisciuneri July 1, 2019 13:54

Doug Balog added 3 commits July 1, 2019 10:22

Fixes minor issue with int division.

56762a9

Reformatted code.

d649bb6

Updates from cr comments.

3096d64

phpisciuneri approved these changes Jul 1, 2019

View reviewed changes

dougb merged commit 1c4a626 into master Jul 1, 2019

dougb deleted the threshold branch July 1, 2019 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Threshold to RowBased checks #13

Adds Threshold to RowBased checks #13

dougb commented Jun 14, 2019 •

edited

Loading

jhahnn21 commented Jun 20, 2019

colindean commented Jun 21, 2019

phpisciuneri left a comment

phpisciuneri Jun 28, 2019

dougb Jun 28, 2019

phpisciuneri Jun 28, 2019

dougb Jun 28, 2019

phpisciuneri Jun 28, 2019

dougb Jun 28, 2019

phpisciuneri Jun 28, 2019

dougb Jun 28, 2019

phpisciuneri Jun 28, 2019

dougb Jun 28, 2019

phpisciuneri Jun 28, 2019

dougb Jun 28, 2019

phpisciuneri Jun 28, 2019

dougb Jul 1, 2019

phpisciuneri Jun 28, 2019

dougb Jul 1, 2019

phpisciuneri Jul 1, 2019

dougb Jul 1, 2019

phpisciuneri Jun 28, 2019

dougb Jul 1, 2019

phpisciuneri Jun 28, 2019

dougb Jul 1, 2019

phpisciuneri commented Jun 28, 2019

phpisciuneri commented Jul 1, 2019


		If the threshold is `< 1` it is considered a fraction of the row count. For example `0.25` would fail the check if more then `rowCount * 0.25` of the rows fail the check.
		If the threshold ends in a `%` its considered a percentage of the row count. For eample `33%` would fail the check if more then `rowCount * 0.33` of the rows fail the check.

Adds Threshold to RowBased checks #13

Adds Threshold to RowBased checks #13

Conversation

dougb commented Jun 14, 2019 • edited Loading

jhahnn21 commented Jun 20, 2019

colindean commented Jun 21, 2019

phpisciuneri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phpisciuneri commented Jun 28, 2019

phpisciuneri commented Jul 1, 2019

dougb commented Jun 14, 2019 •

edited

Loading