Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for percentiles #543

Open
tklinchik opened this issue Dec 18, 2023 · 10 comments
Open

Add support for percentiles #543

tklinchik opened this issue Dec 18, 2023 · 10 comments
Labels
enhancement New feature or request research This requires a deeper dive to gather a better understanding
Milestone

Comments

@tklinchik
Copy link

It appears that API already supports median() and medianFor() but not arbitrary percentiles.
To make it on par with other DataFrame APIs it would desirable to have support for percentile(percentile = 0.95), etc.

@Jolanrensen Jolanrensen added the enhancement New feature or request label Jan 2, 2024
@Jolanrensen Jolanrensen added this to the Backlog milestone Jan 2, 2024
@Jolanrensen
Copy link
Collaborator

Would that be comparable to quantile() in Pandas? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html

@tklinchik
Copy link
Author

Yes, that would be ideal

@Jolanrensen Jolanrensen added the research This requires a deeper dive to gather a better understanding label Jan 2, 2024
@cmelchior
Copy link
Contributor

cmelchior commented Jan 17, 2024

I was looking a bit into this because it sounded like something that should be easy to add, but I ran into a few peculiarities.

This is the implementation for median:

@PublishedApi
internal inline fun <reified T : Comparable<T>> Iterable<T?>.median(type: KType): T? {
    val list = if (type.isMarkedNullable) filterNotNull() else (this as Iterable<T>).asList()
    val size = list.size
    if (size == 0) return null
    val index = size / 2
    if (index == 0 || size % 2 == 1) return list.quickSelect(index)
    return when (type.classifier) {
        Double::class -> ((list.quickSelect(index - 1) as Double + list.quickSelect(index) as Double) / 2.0) as T
        Int::class -> ((list.quickSelect(index - 1) as Int + list.quickSelect(index) as Int) / 2) as T
        Long::class -> ((list.quickSelect(index - 1) as Long + list.quickSelect(index) as Long) / 2L) as T
        Byte::class -> ((list.quickSelect(index - 1) as Byte + list.quickSelect(index) as Byte) / 2).toByte() as T
        BigDecimal::class -> ((list.quickSelect(index - 1) as BigDecimal + list.quickSelect(index) as BigDecimal) / BigDecimal(2)) as T
        else -> list.quickSelect(index - 1)
    }
}

For some reason, it doesn't handle Short, Float, and BigInteger, which leads to funny examples like this one, where the same data will return different values depending on the type:

    @Test
    fun `median of two columns`() {
        val df = dataFrameOf("a", "b")(
            1.0, 4.0,
            2.0, 6.0,
            7.0, 7.0,
            7.0, 7.0,
        )
        df.median("a", "b") shouldBe 5
    }

    @Test
    fun `median of Short values`() {
        val df = dataFrameOf("a", "b")(
            1.toShort(), 4.toShort(),
            2.toShort(), 6.toShort(),
            7.toShort(), 7.toShort()
        )
        df.median("a", "b") shouldBe 4
    }

Also, the code for Int will round data in perhaps slightly surprising ways, at least it doesn't match any of the Interpolation modes available in Panda. The reason for this seems to be that the result of the median calculation is cast back to Int even though Double would probably be a more accurate answer, i.e. this example:

    @Test
    fun `median of two columns`() {
        val df = dataFrameOf("a", "b")(
            1, 4,
            2, 6,
            9, 10,
            11, 12,
        )
        df.median("a", "b") shouldBe 7 // For pandas, lower = 6, higher = 9, midpoint = 7.5  
    }

This begs the question of what to do with a quantile method (which is just a generalization of median).

Would it make sense to be able to specify the return value as well or should people convert their List<T> before calling median?

I.e. something like this would be the most flexible:

public enum class Interpolation {
    LINEAR, LOWER, HIGHER, NEAREST, MIDPOINT
}
internal inline fun <reified T : Comparable<T>, reified R: Any> Iterable<T?>.quantile(type: KType, q: Float, interpolation: Interpolation = Interpolation.LINEAR): R? { TODO() }

// Usage
val median: Double = dataframe.quantile<Double>(q = 0.5) 

@tklinchik
Copy link
Author

I believe the right way to approach this is to make quantile result to be always Double type since hard to imagine a limited use case for quantiles being of the same type as input.
In a perfect world you'd even specify an interpolation method to use for determining a quantile as Pandas does it but midpoint or linear would be a good starting point.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

Also, worth mentioning that for large data frames it would be beneficial to request multiple quantiles in one pass such as:

fun quantile(quantiles: List<Double>): Map<Double, Double> since you'd limit the number of times you sort the values to being done only once. However, not sure how that impacts the DataFrame DSL though as it would likely make it more complicated.

@cmelchior
Copy link
Contributor

cmelchior commented Jan 17, 2024

For what it is worth, mean() also always returns Double, but for some reason, it doesn't support BigDecimal/BigInteger either (calculating the mean on BigDecimal would also need to return a BigDecimal)

So it looks like the API is going to be inconsistent no matter the solution. At least if we want to avoid a breaking change.

@tklinchik
Copy link
Author

I agree, seems that if BigDecimal support is needed it may require a special treatment but everything else should be covered by Double. Another option is to make quantiles always return BigDecimal to make it consistent but that might be overkill.

@Jolanrensen
Copy link
Collaborator

@cmelchior I indeed also found we're missing some types in (some of) the statistics functions #558.

I think returning Double's everywhere is fine as that's used by many libraries. BigDecimal/BigInteger are java-specific, so I don't mind if they get their own special treatment. That might also make the switch to multiplatform easier if we ever attempt that.

@cmelchior
Copy link
Contributor

@Jolanrensen What is your stance on breaking changes? Would it be okay to change median() to return Double for Byte/Short/Int/Long/Float/Double?

Since median() also works on all Comparables (including String). These non-number types would probably just keep the existing behaviour?

@Jolanrensen
Copy link
Collaborator

@cmelchior
I think the expected behaviour for users would be:

Double, Int, Long, Byte, Short, Float -> median as Double
BigDecimal, BigInteger -> median as BigDecimal
Number -> attempt .toDouble() and run median as Double
Other comparable -> median as original type, middle of the list preferring the first of 2.

Might be best to implement this in a separate PR first though, as it solves a specific part of #558 and it's indeed a breaking change, but for the better :) (hopefully, haha)

@tklinchik
Copy link
Author

Here is my workaround at the moment that could be helpful:

import org.apache.commons.math3.stat.StatUtils

fun DataColumn<*>.quantile(quantile: Double): Double? {
    check(quantile >= 0.0 && quantile <= 1.0) { quantile }
    val values = when (typeClass) {
        Int::class    -> (values as Iterable<Int?>).filterNotNull().toList().map { it.toDouble() }
        Long::class   -> (values as Iterable<Long?>).filterNotNull().toList().map { it.toDouble() }
        Double::class -> (values as Iterable<Double?>).filterNotNull().toList()
        else          -> throw IllegalArgumentException("Not supported: $typeClass")
    }.toTypedArray().run { toDoubleArray() } 
    return StatUtils.percentile(values, quantile * 100.0)
}

fun DataFrame<*>.quantile(name: String, quantile: Double): Double? = this.get(name).quantile(quantile)

That allows me to use it as standalone and within aggregate {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

No branches or pull requests

3 participants