Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Threshold to RowBased checks #13

Merged
merged 19 commits into from
Jul 1, 2019
Merged

Adds Threshold to RowBased checks #13

merged 19 commits into from
Jul 1, 2019

Conversation

dougb
Copy link
Collaborator

@dougb dougb commented Jun 14, 2019

This pr adds thresholds to RowBased checks (negativeCheck, nullCheck, rangeCheck)
Thresholds lets the user define an acceptable error threshold that they can tolerate.
Currently if you have a single error in a Billion row table, the check will fail.
A threshold can be represented as a number of rows, or a percentage of the total number of rows processed.

type: negativeCheck
column: price
threshold: 2.3%

@dougb dougb changed the title WIP: Adds Threshold to RowBased checks Adds Threshold to RowBased checks Jun 19, 2019
@dougb dougb requested review from phpisciuneri and jhahnn21 June 19, 2019 02:29
@jhahnn21
Copy link

@dougb Why is this PR so big?

@colindean
Copy link
Collaborator

@jhahnn21 much of the added SLOC is test code.

Copy link
Contributor

@phpisciuneri phpisciuneri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor things we can quickly address

README.md Outdated
The third section are the validators. To specify a validator, you
first specify the type as one of the validators, then specify the
arguments for that validator. Some of the validators support an error
threshold. This options allows the user specify the number of errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option allows the user to specify...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


If the threshold is `< 1` it is considered a fraction of the row count. For example `0.25` would fail the check if more then `rowCount * 0.25` of the rows fail the check.
If the threshold ends in a `%` its considered a percentage of the row count. For eample `33%` would fail the check if more then `rowCount * 0.33` of the rows fail the check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice explanation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

("column", Json.fromString(column)),
("threshold", Json.fromString(threshold.getOrElse("0"))),
("failed", Json.fromBoolean(failed)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this space :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

import io.circe.Json
import io.circe.syntax._
import com.target.data_validator.{ValidatorCheckEvent, ValidatorCounter, ValidatorError, ValidatorQuickCheckError}
import com.target.data_validator.VarSubstitution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace these two imports with import com.target.data_validator._

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

* Calculates the max acceptable number of errors from threshold and rowCount.
* @param rowCount of table.
* @return max number of errors we can tolerate.
* if threshold < 0, then its a percentage of rowCount.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threshold < 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

val maxJson = Json.fromDouble(3.0)
val sut = RangeCheck("max",
None, maxJson, None, Some("30%")) // scalastyle:ignore
assert(true) // FIXME!!!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, We have other tests that check the calculation of threshold count.

}

describe("mkDict") {
it("simple case") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this block needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I fixed the test.

package com.target.data_validator.validator

import com.target.TestingSparkSession
import com.target.data_validator.TestHelpers._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import com.target.TestHelpers._

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get error with import com.target.TestHelpers._

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some more comments above that are being collapsed... Maybe you didn't see them? I suggested renaming the package of TestHelpers to match the location of the file in the test directory.

package com.target to correspond to location in directory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github was hiding them. I will address them and update pr.

import com.target.TestingSparkSession
import com.target.data_validator.TestHelpers._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{BooleanType, DoubleType, IntegerType, StringType, StructField, StructType}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.spark.sql.types._

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

val sut = NullCheck("col", Some("peanuts"))
assert(sut.configCheckThreshold)
assert(sut.failed)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to handle negative cases. The question is how? Do we error out, or just treat them as a threshold of 0?

      it ("is negative") {
        val sut = NullCheck("col", Some("-10"))
        ???
      }
      it ("is negative fraction") {
        val sut = NullCheck("col", Some("-0.1"))
        ???
      }
      it ("is negative percent") {
        val sut = NullCheck("col", Some("-10%"))
        ???
      }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, added these and a couple more.

@phpisciuneri
Copy link
Contributor

@jhahnn21 are you planning to review?

@dougb dougb requested a review from phpisciuneri July 1, 2019 13:54
@phpisciuneri
Copy link
Contributor

@dougb 👍

@dougb dougb merged commit 1c4a626 into master Jul 1, 2019
@dougb dougb deleted the threshold branch July 1, 2019 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants