Add support for percentiles #543

tklinchik · 2023-12-18T20:35:06Z

It appears that API already supports median() and medianFor() but not arbitrary percentiles.
To make it on par with other DataFrame APIs it would desirable to have support for percentile(percentile = 0.95), etc.

The text was updated successfully, but these errors were encountered:

Jolanrensen · 2024-01-02T12:18:48Z

Would that be comparable to quantile() in Pandas? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html

tklinchik · 2024-01-02T13:47:20Z

Yes, that would be ideal

cmelchior · 2024-01-17T18:48:00Z

I was looking a bit into this because it sounded like something that should be easy to add, but I ran into a few peculiarities.

This is the implementation for median:

@PublishedApi
internal inline fun <reified T : Comparable<T>> Iterable<T?>.median(type: KType): T? {
    val list = if (type.isMarkedNullable) filterNotNull() else (this as Iterable<T>).asList()
    val size = list.size
    if (size == 0) return null
    val index = size / 2
    if (index == 0 || size % 2 == 1) return list.quickSelect(index)
    return when (type.classifier) {
        Double::class -> ((list.quickSelect(index - 1) as Double + list.quickSelect(index) as Double) / 2.0) as T
        Int::class -> ((list.quickSelect(index - 1) as Int + list.quickSelect(index) as Int) / 2) as T
        Long::class -> ((list.quickSelect(index - 1) as Long + list.quickSelect(index) as Long) / 2L) as T
        Byte::class -> ((list.quickSelect(index - 1) as Byte + list.quickSelect(index) as Byte) / 2).toByte() as T
        BigDecimal::class -> ((list.quickSelect(index - 1) as BigDecimal + list.quickSelect(index) as BigDecimal) / BigDecimal(2)) as T
        else -> list.quickSelect(index - 1)
    }
}

For some reason, it doesn't handle Short, Float, and BigInteger, which leads to funny examples like this one, where the same data will return different values depending on the type:

    @Test
    fun `median of two columns`() {
        val df = dataFrameOf("a", "b")(
            1.0, 4.0,
            2.0, 6.0,
            7.0, 7.0,
            7.0, 7.0,
        )
        df.median("a", "b") shouldBe 5
    }

    @Test
    fun `median of Short values`() {
        val df = dataFrameOf("a", "b")(
            1.toShort(), 4.toShort(),
            2.toShort(), 6.toShort(),
            7.toShort(), 7.toShort()
        )
        df.median("a", "b") shouldBe 4
    }

Also, the code for Int will round data in perhaps slightly surprising ways, at least it doesn't match any of the Interpolation modes available in Panda. The reason for this seems to be that the result of the median calculation is cast back to Int even though Double would probably be a more accurate answer, i.e. this example:

    @Test
    fun `median of two columns`() {
        val df = dataFrameOf("a", "b")(
            1, 4,
            2, 6,
            9, 10,
            11, 12,
        )
        df.median("a", "b") shouldBe 7 // For pandas, lower = 6, higher = 9, midpoint = 7.5  
    }

This begs the question of what to do with a quantile method (which is just a generalization of median).

Would it make sense to be able to specify the return value as well or should people convert their List<T> before calling median?

I.e. something like this would be the most flexible:

public enum class Interpolation {
    LINEAR, LOWER, HIGHER, NEAREST, MIDPOINT
}
internal inline fun <reified T : Comparable<T>, reified R: Any> Iterable<T?>.quantile(type: KType, q: Float, interpolation: Interpolation = Interpolation.LINEAR): R? { TODO() }

// Usage
val median: Double = dataframe.quantile<Double>(q = 0.5)

tklinchik · 2024-01-17T19:48:50Z

I believe the right way to approach this is to make quantile result to be always Double type since hard to imagine a limited use case for quantiles being of the same type as input.
In a perfect world you'd even specify an interpolation method to use for determining a quantile as Pandas does it but midpoint or linear would be a good starting point.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

Also, worth mentioning that for large data frames it would be beneficial to request multiple quantiles in one pass such as:

fun quantile(quantiles: List<Double>): Map<Double, Double> since you'd limit the number of times you sort the values to being done only once. However, not sure how that impacts the DataFrame DSL though as it would likely make it more complicated.

cmelchior · 2024-01-17T20:05:17Z

For what it is worth, mean() also always returns Double, but for some reason, it doesn't support BigDecimal/BigInteger either (calculating the mean on BigDecimal would also need to return a BigDecimal)

So it looks like the API is going to be inconsistent no matter the solution. At least if we want to avoid a breaking change.

tklinchik · 2024-01-17T20:24:22Z

I agree, seems that if BigDecimal support is needed it may require a special treatment but everything else should be covered by Double. Another option is to make quantiles always return BigDecimal to make it consistent but that might be overkill.

Jolanrensen · 2024-01-18T11:28:30Z

@cmelchior I indeed also found we're missing some types in (some of) the statistics functions #558.

I think returning Double's everywhere is fine as that's used by many libraries. BigDecimal/BigInteger are java-specific, so I don't mind if they get their own special treatment. That might also make the switch to multiplatform easier if we ever attempt that.

cmelchior · 2024-01-18T13:35:39Z

@Jolanrensen What is your stance on breaking changes? Would it be okay to change median() to return Double for Byte/Short/Int/Long/Float/Double?

Since median() also works on all Comparables (including String). These non-number types would probably just keep the existing behaviour?

Jolanrensen · 2024-01-18T14:55:31Z

@cmelchior
I think the expected behaviour for users would be:

Double, Int, Long, Byte, Short, Float -> median as Double
BigDecimal, BigInteger -> median as BigDecimal
Number -> attempt .toDouble() and run median as Double
Other comparable -> median as original type, middle of the list preferring the first of 2.

Might be best to implement this in a separate PR first though, as it solves a specific part of #558 and it's indeed a breaking change, but for the better :) (hopefully, haha)

tklinchik · 2024-01-23T02:29:56Z

Here is my workaround at the moment that could be helpful:

import org.apache.commons.math3.stat.StatUtils

fun DataColumn<*>.quantile(quantile: Double): Double? {
    check(quantile >= 0.0 && quantile <= 1.0) { quantile }
    val values = when (typeClass) {
        Int::class    -> (values as Iterable<Int?>).filterNotNull().toList().map { it.toDouble() }
        Long::class   -> (values as Iterable<Long?>).filterNotNull().toList().map { it.toDouble() }
        Double::class -> (values as Iterable<Double?>).filterNotNull().toList()
        else          -> throw IllegalArgumentException("Not supported: $typeClass")
    }.toTypedArray().run { toDoubleArray() } 
    return StatUtils.percentile(values, quantile * 100.0)
}

fun DataFrame<*>.quantile(name: String, quantile: Double): Double? = this.get(name).quantile(quantile)

That allows me to use it as standalone and within aggregate {}

Jolanrensen added the enhancement New feature or request label Jan 2, 2024

Jolanrensen added this to the Backlog milestone Jan 2, 2024

Jolanrensen added the research This requires a deeper dive to gather a better understanding label Jan 2, 2024

Jolanrensen mentioned this issue Jan 18, 2024

Describe breaks on Number column #558

Closed

11 tasks

Jolanrensen mentioned this issue Nov 21, 2024

☂ Statistics streamlining #961

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for percentiles #543

Add support for percentiles #543

tklinchik commented Dec 18, 2023

Jolanrensen commented Jan 2, 2024

tklinchik commented Jan 2, 2024

cmelchior commented Jan 17, 2024 •

edited

Loading

tklinchik commented Jan 17, 2024

cmelchior commented Jan 17, 2024 •

edited

Loading

tklinchik commented Jan 17, 2024

Jolanrensen commented Jan 18, 2024

cmelchior commented Jan 18, 2024

Jolanrensen commented Jan 18, 2024

tklinchik commented Jan 23, 2024

Add support for percentiles #543

Add support for percentiles #543

Comments

tklinchik commented Dec 18, 2023

Jolanrensen commented Jan 2, 2024

tklinchik commented Jan 2, 2024

cmelchior commented Jan 17, 2024 • edited Loading

tklinchik commented Jan 17, 2024

cmelchior commented Jan 17, 2024 • edited Loading

tklinchik commented Jan 17, 2024

Jolanrensen commented Jan 18, 2024

cmelchior commented Jan 18, 2024

Jolanrensen commented Jan 18, 2024

tklinchik commented Jan 23, 2024

cmelchior commented Jan 17, 2024 •

edited

Loading

cmelchior commented Jan 17, 2024 •

edited

Loading