Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe breaks on Number column #558

Closed
8 of 11 tasks
Jolanrensen opened this issue Jan 12, 2024 · 5 comments · Fixed by #937
Closed
8 of 11 tasks

Describe breaks on Number column #558

Jolanrensen opened this issue Jan 12, 2024 · 5 comments · Fixed by #937
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jan 12, 2024

Describe breaks on Number columns. This happens because the Iterable<Number>.std() function accepts Number but doesn't convert them to Double (like mean() does).

There are a couple more missing actually:

  • cumSum

    • Misses Byte, Short
    • Has DataColumn overloads but not Iterable/Sequence
  • mean

    • Has Sequence<Double | Float> but not for other Number types
  • median

    • Misses Float, Byte, Short, Number (it only works on Comparable)
    • Needs to handle other types consistently
    • No Sequence overloads
    • Cannot skipNA (if applicable)
  • min and max

    • internal Iterable<T>.min and max are not used and can be removed. Stdlib functions for Comparable sequences and iterables are used instead.
    • Misses Number (it only works on Comparable)
    • Short and Byte are converted to Int for some reason
  • std

    • Breaks if type is Number
    • Short and Byte are cast to Int which works but is a bit iffy
    • Iterable overloads missing for Number, Short, Byte
    • Sequence overloads missing
    • Nullable overloads missing for Iterable (and sequence)
  • varianceAndMean

    • also provides std(ddof: Int) function without docs of what ddof even means, as well as count. Could have a better name. Also can produce nulls?? this screams for documentation.
    • variance functions are missing on DataColumns entirely (had to be added separately for Kandy)
    • Misses Short, Byte, Number, and nullable overloads
    • Misses Sequence overloads
  • sum

    • Has TODOs where types are amiss
    • Misses Float(!), Short, Byte, Number in various Iterable overloads.
  • All are also missing BigInteger as we're supporting BigDecimal too.

  • There are plenty of public overloads on Iterable and Sequence. It's fine to have them internally, but I feel like we're clogging the public scope here. mean, for instance, is already covered in the stdlib.

  • We need to honor some conversion table (see below)

  • Describe now only shows min, median, and max for <T : Comparable<T>> columns, so not Number. This makes sense, but not from a user-perspective. We can just convert to Double first, then calculate it.

@Jolanrensen Jolanrensen added bug Something isn't working invalid This issue/PR doesn't seem right labels Jan 12, 2024
@koperagen
Copy link
Collaborator

#352 probably same problem

@Jolanrensen
Copy link
Collaborator Author

As mentioned here #543, some functions like median(ints) might result in an unexpectedly rounded Int in return. It might be better to let all functions return Double and then handle BigInteger / BigDecimal separately for now, as they're java-specific for now.

@zaleslaw
Copy link
Collaborator

It looks like an umbrella ticket and should be split to a smaller task

@Jolanrensen Jolanrensen modified the milestones: 0.14.0, 0.15.0 Aug 1, 2024
@Jolanrensen Jolanrensen self-assigned this Aug 1, 2024
@Jolanrensen Jolanrensen changed the title Describe breaks on Number column (and other statistics inconsistencies) ☂️ Describe breaks on Number column (and other statistics inconsistencies) Sep 20, 2024
@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Nov 6, 2024

It may be best to first just fix some of the most annoying bugs or obvious oversights, like those relating to describe() or missing types.

Next a revamp may be best. We'll need to hide public functions that are not on DataColumn as @AndreiKingsley will probably make a statistics library for that anyway.

I'd also recommend making and honoring a conversion table. Something like:

Function Conversion extra information nulls in input
mean Int -> Double All nulls are filtered out
Short -> Double
Byte -> Double
Long -> Double
Double -> Double skipNaN option, false by default
Float -> Double skipNaN option, false by default
BigInteger -> BigDecimal? null instead of NaN in output
BigDecimal -> BigDecimal? null instead of NaN in output
Number -> Double skipNaN option, false by default
Nothing / no values -> Double (NaN)
sum Int -> Int All default to zero if no values All nulls are filtered out
Short -> Int
Byte -> Int
Long -> Long
Double -> Double skipNaN option, false by default
Float -> Float skipNaN option, false by default
BigInteger -> BigInteger
BigDecimal -> BigDecimal
Number -> Double skipNaN option, false by default
Nothing / no values -> Double (0.0)
cumSum Int -> Int All default to zero if no values All can optionally skip nulls in input with skipNull option, true by default
Short -> Int important because order matters with cumSum
Byte -> Int
Long -> Long
Double -> Double skipNaN option, true by default
Float -> Float skipNaN option, true by default
BigInteger -> BigInteger
BigDecimal -> BigDecimal
Number -> Double skipNaN option, true by default
Nothing / no values -> Double (0.0)
min/max T -> T? where T : Comparable<T> For all: null if no elements All nulls are filtered out
Int -> Int?
Short -> Short?
Byte -> Byte?
Long -> Long?
Double -> Double? If has NaN, result will be NaN, needs skipNaN option?
Float -> Float? If has NaN, result will be NaN, needs skipNaN option?
BigInteger -> BigInteger?
BigDecimal -> BigDecimal?
Number -> Double? If has NaN, result will be NaN, needs skipNaN option?
Nothing / no values -> Double? (null)
(Don't convert Short/Byte to Int!)
median T -> T? where T : Comparable<T> For all: median of even list will cause conversion to Double All nulls are filtered out
Int -> Double? and null if no elements
Short -> Double?
Byte -> Double?
Long -> Double?
Double -> Double?
Float -> Double?
BigInteger -> BigDecimal?
BigDecimal -> BigDecimal?
Number -> Double?
Nothing / no values -> Double? (null)
std Int -> Double All have DDoF (Delta Degrees of Freedom) argument All nulls are filtered out
Short -> Double
Byte -> Double
Long -> Double
Double -> Double skipNaN option, false by default
Float -> Double skipNaN option, false by default
BigInteger -> BigDecimal? null instead of NaN in output
BigDecimal -> BigDecimal? null instead of NaN in output
Number -> Double skipNaN option, false by default
Nothing / no values -> Double (NaN)
var (want to add?) same as std

@Jolanrensen Jolanrensen changed the title ☂️ Describe breaks on Number column (and other statistics inconsistencies) Describe breaks on Number column Nov 21, 2024
@Jolanrensen
Copy link
Collaborator Author

This issue will be continued in #961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants