feature request: `step_cut` does not match results from `dplyr::mutate(var = cut(var,n))` when `n` is an integer #1398

jkylearmstrong · 2024-11-19T05:46:53Z

feature-request

It took me a minute to realize the behavior of step_cut is different than that of cut.

The functionality of step_cut expects explicit breaks whereas cut will generate intervals when a single integer value is provided in breaks.

library('dplyr')
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library('tidymodels')
tidymodels_prefer()

df <- data.frame(x = c(1:10,31:40), y = 1:20)

rec3 <- recipe(df) %>%
  step_mutate(x_cut = x) %>%
  step_cut(x_cut, breaks = 3, include_outside_range = T) %>%
  prep()

tidy(rec3, 2)
#> # A tibble: 3 × 3
#>   terms value id       
#>   <chr> <dbl> <chr>    
#> 1 x_cut     1 cut_5ea2R
#> 2 x_cut     3 cut_5ea2R
#> 3 x_cut    40 cut_5ea2R

bake(rec3, new_data = df) %>%
  mutate(dplyr_cut = cut(x,3, include.lowest = TRUE))
#> # A tibble: 20 × 4
#>        x     y x_cut   dplyr_cut 
#>    <int> <int> <fct>   <fct>     
#>  1     1     1 [min,3] [0.961,14]
#>  2     2     2 [min,3] [0.961,14]
#>  3     3     3 [min,3] [0.961,14]
#>  4     4     4 (3,max] [0.961,14]
#>  5     5     5 (3,max] [0.961,14]
#>  6     6     6 (3,max] [0.961,14]
#>  7     7     7 (3,max] [0.961,14]
#>  8     8     8 (3,max] [0.961,14]
#>  9     9     9 (3,max] [0.961,14]
#> 10    10    10 (3,max] [0.961,14]
#> 11    31    11 (3,max] (27,40]   
#> 12    32    12 (3,max] (27,40]   
#> 13    33    13 (3,max] (27,40]   
#> 14    34    14 (3,max] (27,40]   
#> 15    35    15 (3,max] (27,40]   
#> 16    36    16 (3,max] (27,40]   
#> 17    37    17 (3,max] (27,40]   
#> 18    38    18 (3,max] (27,40]   
#> 19    39    19 (3,max] (27,40]   
#> 20    40    20 (3,max] (27,40]

Created on 2024-11-19 with [reprex v2.1.1](https://reprex.tidyverse.org/)

Perhaps an option for n_breaks to the step_cut option, could also exist for vectors or named lists and just apply the vector, named list to the list of variables

# psuedo code for step_cut_n_breaks
step_cut_n_breaks <- function(var, n_breaks, include_outside_range) {
  var_min <- min(var)
  var_max <- max(var)
  
  diff <- (var_max - var_min)/n_cut_breaks 
  
  res_seq <- seq(
    var_min,
    var_max,
    by = diff
  )
  
  res_seq <- res_seq[-1]
  res_seq <- res_seq[-length(resl_seq)]
  
  # Once the ranges have been computed you could still use the existing step_cut functionality: 
  step_cut(force_cuts, breaks = res_seq, include_outside_range  = include_outside_range) 
}

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2024-11-20T17:37:46Z

Hello @jkylearmstrong 👋

Thanks for your issue! Have you seen the step_discretize() step? https://recipes.tidymodels.org/reference/step_discretize.html

It contains the argument num_breaks to split by number instead of by position as is done in step_cut()

jkylearmstrong · 2024-11-20T18:11:21Z

Hey @EmilHvitfeldt, thanks for your response.

Yes, I looked at step_discretize that as well.

To my understanding, step_discretize() is concerned with the same number of points within each bin. This means that if my dataset has 8 rows and I use step_discretize() with num_breaks = 4, each bin would have 2 values, regardless of range. This also does not match the behavior of cut.

df <- data.frame(
  x = c(1, 2, 3, 100, 101, 500, 501, 600)
)

library('tidyverse')
library('tidymodels')
tidymodels_prefer()


rec_dis <- recipe(df) %>%
  step_mutate(x_dis = x) %>%
  step_discretize(x_dis, num_breaks = 4, min_unique =1)

prep_rec <- prep(rec_dis)

bake(prep_rec, new_data = df) %>%
  mutate(x_cut = cut(x,4))
#> # A tibble: 8 × 3
#>       x x_dis x_cut      
#>   <dbl> <fct> <fct>      
#> 1     1 bin1  (0.401,151]
#> 2     2 bin1  (0.401,151]
#> 3     3 bin2  (0.401,151]
#> 4   100 bin2  (0.401,151]
#> 5   101 bin3  (0.401,151]
#> 6   500 bin3  (450,601]  
#> 7   501 bin4  (450,601]  
#> 8   600 bin4  (450,601]

When you supply cut with an integer, it roughly performs the pseudo-code I supplied above: the range is computed and then divided into n equal parts, and then the values fall within one of the distinct buckets.

For most of the recipe steps the behavior of step_* matches with the behavior in dplyr or base R. It took me a minute to realize that step_cut requires the ranges to be supplied to match the results used in dplyr pipe.

EmilHvitfeldt · 2024-11-20T21:02:05Z

The way I see it, is that we don't want to match functionality of step_cut() with cut() 100% here. having different behaviour when passing a 1 length vector compared to a longer vector isn't great in cut() in my opinion.

if we were to add this functionality, I think it would fit better in step_discretize() since step_cut() reads

step_cut() creates a specification of a recipe step that cuts a numeric variable into a factor based on provided boundary values.

we could make step_discretize() calculate points to satisfy different constraints such as "equally spaced" or "evenly distributed".

We should also update the documentation to reflect how step_cut() and cut() works

jkylearmstrong · 2024-11-22T06:11:31Z

@EmilHvitfeldt

Thanks for considering.

My story was I was looking to apply some cuts I had made in my exploration to a variable "Low", "Medium", and "High" and then use tidymodels framework for repeated analysis, I looked at the index of functions and found the corresponding step_cut just put it into the pipeline and was like "okay this should be easy then". However, when I was looking at the results they didn't match with some of my initial exploration. So, I was just a little shocked at first since most of the step_* functions match their dplyr or base counterparts. So step_cut didn't behave as I expected which is distinctly different from the behavior that is expected. Additional disclaimer or language around

step_cut() will call base::cut() in the baking step with include.lowest set to TRUE.

might be helpful, as it does something slightly different from this.

However, when looking at the examples it becomes more clear as to how it's being applied.

Thanks for this package, overall I find it very helpful. :)

EmilHvitfeldt added feature a feature request or enhancement documentation labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: `step_cut` does not match results from `dplyr::mutate(var = cut(var,n))` when `n` is an integer #1398

feature request: `step_cut` does not match results from `dplyr::mutate(var = cut(var,n))` when `n` is an integer #1398

jkylearmstrong commented Nov 19, 2024

EmilHvitfeldt commented Nov 20, 2024

jkylearmstrong commented Nov 20, 2024

EmilHvitfeldt commented Nov 20, 2024

jkylearmstrong commented Nov 22, 2024 •

edited

Loading

feature request: step_cut does not match results from dplyr::mutate(var = cut(var,n)) when n is an integer #1398

feature request: step_cut does not match results from dplyr::mutate(var = cut(var,n)) when n is an integer #1398

Comments

jkylearmstrong commented Nov 19, 2024

EmilHvitfeldt commented Nov 20, 2024

jkylearmstrong commented Nov 20, 2024

EmilHvitfeldt commented Nov 20, 2024

jkylearmstrong commented Nov 22, 2024 • edited Loading

feature request: `step_cut` does not match results from `dplyr::mutate(var = cut(var,n))` when `n` is an integer #1398

feature request: `step_cut` does not match results from `dplyr::mutate(var = cut(var,n))` when `n` is an integer #1398

jkylearmstrong commented Nov 22, 2024 •

edited

Loading