Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: step_cut does not match results from dplyr::mutate(var = cut(var,n)) when n is an integer #1398

Open
jkylearmstrong opened this issue Nov 19, 2024 · 4 comments
Labels
documentation feature a feature request or enhancement

Comments

@jkylearmstrong
Copy link

feature-request

It took me a minute to realize the behavior of step_cut is different than that of cut.

The functionality of step_cut expects explicit breaks whereas cut will generate intervals when a single integer value is provided in breaks.

library('dplyr')
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library('tidymodels')
tidymodels_prefer()

df <- data.frame(x = c(1:10,31:40), y = 1:20)

rec3 <- recipe(df) %>%
  step_mutate(x_cut = x) %>%
  step_cut(x_cut, breaks = 3, include_outside_range = T) %>%
  prep()

tidy(rec3, 2)
#> # A tibble: 3 × 3
#>   terms value id       
#>   <chr> <dbl> <chr>    
#> 1 x_cut     1 cut_5ea2R
#> 2 x_cut     3 cut_5ea2R
#> 3 x_cut    40 cut_5ea2R

bake(rec3, new_data = df) %>%
  mutate(dplyr_cut = cut(x,3, include.lowest = TRUE))
#> # A tibble: 20 × 4
#>        x     y x_cut   dplyr_cut 
#>    <int> <int> <fct>   <fct>     
#>  1     1     1 [min,3] [0.961,14]
#>  2     2     2 [min,3] [0.961,14]
#>  3     3     3 [min,3] [0.961,14]
#>  4     4     4 (3,max] [0.961,14]
#>  5     5     5 (3,max] [0.961,14]
#>  6     6     6 (3,max] [0.961,14]
#>  7     7     7 (3,max] [0.961,14]
#>  8     8     8 (3,max] [0.961,14]
#>  9     9     9 (3,max] [0.961,14]
#> 10    10    10 (3,max] [0.961,14]
#> 11    31    11 (3,max] (27,40]   
#> 12    32    12 (3,max] (27,40]   
#> 13    33    13 (3,max] (27,40]   
#> 14    34    14 (3,max] (27,40]   
#> 15    35    15 (3,max] (27,40]   
#> 16    36    16 (3,max] (27,40]   
#> 17    37    17 (3,max] (27,40]   
#> 18    38    18 (3,max] (27,40]   
#> 19    39    19 (3,max] (27,40]   
#> 20    40    20 (3,max] (27,40]
Created on 2024-11-19 with [reprex v2.1.1](https://reprex.tidyverse.org/)

Perhaps an option for n_breaks to the step_cut option, could also exist for vectors or named lists and just apply the vector, named list to the list of variables

# psuedo code for step_cut_n_breaks
step_cut_n_breaks <- function(var, n_breaks, include_outside_range) {
  var_min <- min(var)
  var_max <- max(var)
  
  diff <- (var_max - var_min)/n_cut_breaks 
  
  res_seq <- seq(
    var_min,
    var_max,
    by = diff
  )
  
  res_seq <- res_seq[-1]
  res_seq <- res_seq[-length(resl_seq)]
  
  # Once the ranges have been computed you could still use the existing step_cut functionality: 
  step_cut(force_cuts, breaks = res_seq, include_outside_range  = include_outside_range) 
}
@EmilHvitfeldt
Copy link
Member

Hello @jkylearmstrong 👋

Thanks for your issue! Have you seen the step_discretize() step? https://recipes.tidymodels.org/reference/step_discretize.html

It contains the argument num_breaks to split by number instead of by position as is done in step_cut()

@jkylearmstrong
Copy link
Author

Hey @EmilHvitfeldt, thanks for your response.

Yes, I looked at step_discretize that as well.

To my understanding, step_discretize() is concerned with the same number of points within each bin. This means that if my dataset has 8 rows and I use step_discretize() with num_breaks = 4, each bin would have 2 values, regardless of range. This also does not match the behavior of cut.

df <- data.frame(
  x = c(1, 2, 3, 100, 101, 500, 501, 600)
)

library('tidyverse')
library('tidymodels')
tidymodels_prefer()


rec_dis <- recipe(df) %>%
  step_mutate(x_dis = x) %>%
  step_discretize(x_dis, num_breaks = 4, min_unique =1)

prep_rec <- prep(rec_dis)

bake(prep_rec, new_data = df) %>%
  mutate(x_cut = cut(x,4))
#> # A tibble: 8 × 3
#>       x x_dis x_cut      
#>   <dbl> <fct> <fct>      
#> 1     1 bin1  (0.401,151]
#> 2     2 bin1  (0.401,151]
#> 3     3 bin2  (0.401,151]
#> 4   100 bin2  (0.401,151]
#> 5   101 bin3  (0.401,151]
#> 6   500 bin3  (450,601]  
#> 7   501 bin4  (450,601]  
#> 8   600 bin4  (450,601]

When you supply cut with an integer, it roughly performs the pseudo-code I supplied above: the range is computed and then divided into n equal parts, and then the values fall within one of the distinct buckets.

For most of the recipe steps the behavior of step_* matches with the behavior in dplyr or base R. It took me a minute to realize that step_cut requires the ranges to be supplied to match the results used in dplyr pipe.

@EmilHvitfeldt
Copy link
Member

The way I see it, is that we don't want to match functionality of step_cut() with cut() 100% here. having different behaviour when passing a 1 length vector compared to a longer vector isn't great in cut() in my opinion.

if we were to add this functionality, I think it would fit better in step_discretize() since step_cut() reads

step_cut() creates a specification of a recipe step that cuts a numeric variable into a factor based on provided boundary values.

we could make step_discretize() calculate points to satisfy different constraints such as "equally spaced" or "evenly distributed".

We should also update the documentation to reflect how step_cut() and cut() works

@EmilHvitfeldt EmilHvitfeldt added feature a feature request or enhancement documentation labels Nov 20, 2024
@jkylearmstrong
Copy link
Author

jkylearmstrong commented Nov 22, 2024

@EmilHvitfeldt

Thanks for considering.

My story was I was looking to apply some cuts I had made in my exploration to a variable "Low", "Medium", and "High" and then use tidymodels framework for repeated analysis, I looked at the index of functions and found the corresponding step_cut just put it into the pipeline and was like "okay this should be easy then". However, when I was looking at the results they didn't match with some of my initial exploration. So, I was just a little shocked at first since most of the step_* functions match their dplyr or base counterparts. So step_cut didn't behave as I expected which is distinctly different from the behavior that is expected. Additional disclaimer or language around

step_cut() will call base::cut() in the baking step with include.lowest set to TRUE.

might be helpful, as it does something slightly different from this.

However, when looking at the examples it becomes more clear as to how it's being applied.

Thanks for this package, overall I find it very helpful. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants