-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Threshold to RowBased checks #13
Conversation
@dougb Why is this PR so big? |
@jhahnn21 much of the added SLOC is test code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some minor things we can quickly address
README.md
Outdated
The third section are the validators. To specify a validator, you | ||
first specify the type as one of the validators, then specify the | ||
arguments for that validator. Some of the validators support an error | ||
threshold. This options allows the user specify the number of errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option allows the user to specify...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
If the threshold is `< 1` it is considered a fraction of the row count. For example `0.25` would fail the check if more then `rowCount * 0.25` of the rows fail the check. | ||
If the threshold ends in a `%` its considered a percentage of the row count. For eample `33%` would fail the check if more then `rowCount * 0.33` of the rows fail the check. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
("column", Json.fromString(column)), | ||
("threshold", Json.fromString(threshold.getOrElse("0"))), | ||
("failed", Json.fromBoolean(failed)), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this space :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
import io.circe.Json | ||
import io.circe.syntax._ | ||
import com.target.data_validator.{ValidatorCheckEvent, ValidatorCounter, ValidatorError, ValidatorQuickCheckError} | ||
import com.target.data_validator.VarSubstitution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace these two imports with import com.target.data_validator._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
* Calculates the max acceptable number of errors from threshold and rowCount. | ||
* @param rowCount of table. | ||
* @return max number of errors we can tolerate. | ||
* if threshold < 0, then its a percentage of rowCount. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threshold < 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
val maxJson = Json.fromDouble(3.0) | ||
val sut = RangeCheck("max", | ||
None, maxJson, None, Some("30%")) // scalastyle:ignore | ||
assert(true) // FIXME!!!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, We have other tests that check the calculation of threshold count.
} | ||
|
||
describe("mkDict") { | ||
it("simple case") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this block needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I fixed the test.
package com.target.data_validator.validator | ||
|
||
import com.target.TestingSparkSession | ||
import com.target.data_validator.TestHelpers._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import com.target.TestHelpers._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get error with import com.target.TestHelpers._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some more comments above that are being collapsed... Maybe you didn't see them? I suggested renaming the package of TestHelpers
to match the location of the file in the test directory.
package com.target
to correspond to location in directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github was hiding them. I will address them and update pr.
import com.target.TestingSparkSession | ||
import com.target.data_validator.TestHelpers._ | ||
import org.apache.spark.sql.Row | ||
import org.apache.spark.sql.types.{BooleanType, DoubleType, IntegerType, StringType, StructField, StructType} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import org.apache.spark.sql.types._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
val sut = NullCheck("col", Some("peanuts")) | ||
assert(sut.configCheckThreshold) | ||
assert(sut.failed) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle negative cases. The question is how? Do we error out, or just treat them as a threshold of 0?
it ("is negative") {
val sut = NullCheck("col", Some("-10"))
???
}
it ("is negative fraction") {
val sut = NullCheck("col", Some("-0.1"))
???
}
it ("is negative percent") {
val sut = NullCheck("col", Some("-10%"))
???
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, added these and a couple more.
@jhahnn21 are you planning to review? |
@dougb 👍 |
This pr adds thresholds to RowBased checks (negativeCheck, nullCheck, rangeCheck)
Thresholds lets the user define an acceptable error threshold that they can tolerate.
Currently if you have a single error in a Billion row table, the check will fail.
A threshold can be represented as a number of rows, or a percentage of the total number of rows processed.