Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using an old tidymodels workflow. #1401

Open
reisner opened this issue Nov 26, 2024 · 7 comments
Open

Error using an old tidymodels workflow. #1401

reisner opened this issue Nov 26, 2024 · 7 comments
Labels
bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example

Comments

@reisner
Copy link

reisner commented Nov 26, 2024

Hi there,

I'm trying to use a model and recipe that was saved in 2018, and use them with updated packages.

The original model was saved with:

parsnip v0.0.1
recipes v0.1.4

I'm trying to use it with updated packages:

parsnip v1.2.1
recipes v1.1.0

The original model was trained with svm_poly / ksvm.

I've been able to get the model to work (by setting model$elapsed[["elapsed"]] = 1).

However, I'm trying to get the old recipe to work but am hitting this error:

Error in `group_data()`:
! `.data` must be a valid <grouped_df> object.
Caused by error in `validate_grouped_df()`:
! Corrupt `grouped_df` using old (< 0.8.0) format.
ℹ Strip off old grouping with `ungroup()`.

Is there some way I can fix the recipe to work with new package versions? Or is there a way to extract the recipe components and create an updated recipe object? Unfortunately I dont have access to the original training data.

Thanks!

@EmilHvitfeldt
Copy link
Member

Hello @reisner 👋

Would you be able to provide a little more information.

  1. When did this start happening? Did this just happen? both {recipes} and {parsnip} haven't had a CRAN release in months.
  2. Are you able to provide a traceback() of the error? this will allow us to better narrow down where the issue is

@EmilHvitfeldt EmilHvitfeldt added bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example labels Nov 26, 2024
@reisner
Copy link
Author

reisner commented Nov 26, 2024

Hi @EmilHvitfeldt

I dont think this is a recent issue. It's more about trying to load a legacy model with newer versions of the packages. I'm assuming this isnt something you'll be supporting forever, but i was hoping for some guidance on how I can convert the old recipe into new package format.

Here is the relevant part of the traceback:

> traceback()
29: stop(fallback)
28: signal_abort(cnd, .file)
27: abort(msg, parent = cnd, call = error_call)
26: (function (cnd) 
    {
        msg <- glue("`.data` must be a valid <grouped_df> object.")
        abort(msg, parent = cnd, call = error_call)
    })(structure(list(message = structure("Corrupt `grouped_df` using old (< 0.8.0) format.", names = ""), 
        trace = structure(list(call = list(source("train_model_and_predict.R"), 
            withVisible(eval(ei, envir)), eval(ei, envir), eval(ei, 
                envir), main(), run_prediction(model$model, training_df, 
                model$recipe), bake(trained_recipe, new_data = testing_data), 
            bake.recipe(trained_recipe, new_data = testing_data), 
            recipes_eval_select(terms, new_data, info, check_case_weights = FALSE), 
            vec_slice(info, matches$haystack), `<fn>`(), vec_restore_dispatch(x = x, 
                to = to), vec_restore.grouped_df(x = x, to = to), 
            group_intersect(to, x), intersect(dplyr::group_vars(x), 
                names(new)), dplyr::group_vars(x), group_vars.data.frame(x), 
            setdiff(names(group_data(x)), ".rows"), group_data(x), 
            group_data.grouped_df(x), withCallingHandlers(validate_grouped_df(.data), 
                error = function(cnd) {
                    msg <- glue("`.data` must be a valid <grouped_df> object.")
                    abort(msg, parent = cnd, call = error_call)
                }), validate_grouped_df(.data), abort(bullets)), 
            parent = c(0L, 1L, 1L, 3L, 0L, 5L, 6L, 6L, 8L, 9L, 10L, 
            11L, 11L, 13L, 14L, 14L, 14L, 17L, 17L, 17L, 20L, 20L, 
            22L), visible = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
            TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
            TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE), namespace = c("base", 
            "base", "base", "base", NA, NA, "recipes", "recipes", 
            "recipes", "vctrs", "vctrs", "vctrs", "vctrs", "vctrs", 
            "base", "dplyr", "dplyr", "generics", "dplyr", "dplyr", 
            "base", "dplyr", "rlang"), scope = c("::", "::", "::", 
            "::", "global", "global", "::", ":::", "::", "::", "local", 
            ":::", "local", ":::", "::", "::", ":::", "::", "::", 
            ":::", "::", "::", "::"), error_frame = c(FALSE, FALSE, 
            FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
            FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
            FALSE, FALSE, FALSE, TRUE, FALSE)), row.names = c(NA, 
        -23L), version = 2L, class = c("rlang_trace", "rlib_trace", 
        "tbl", "data.frame")), parent = NULL, body = c(i = "Strip off old grouping with `ungroup()`."), 
        rlang = list(inherit = TRUE), call = validate_grouped_df(.data), 
        use_cli_format = TRUE), class = c("rlang_error", "error", 
    "condition")))
25: signalCondition(cnd)
24: signal_abort(cnd, .file)
23: abort(bullets)
22: validate_grouped_df(.data)
21: withCallingHandlers(validate_grouped_df(.data), error = function(cnd) {
        msg <- glue("`.data` must be a valid <grouped_df> object.")
        abort(msg, parent = cnd, call = error_call)
    })
20: group_data.grouped_df(x)
19: group_data(x)
18: setdiff(names(group_data(x)), ".rows")
17: group_vars.data.frame(x)
16: dplyr::group_vars(x)
15: intersect(dplyr::group_vars(x), names(new))
14: group_intersect(to, x)
13: vec_restore.grouped_df(x = x, to = to)
12: vec_restore_dispatch(x = x, to = to)
11: (function () 
    vec_restore_dispatch(x = x, to = to))()
10: vec_slice(info, matches$haystack)
9: recipes_eval_select(terms, new_data, info, check_case_weights = FALSE)
8: bake.recipe(trained_recipe, new_data = testing_data)
7: bake(trained_recipe, new_data = testing_data) at analysis_functions.R#222
6: run_prediction(model$model, training_df, model$recipe) at train_model_and_predict.R#216

@EmilHvitfeldt
Copy link
Member

I'm sorry, i'm not able to find the cause of this issue with the current information.

Ideally old objects should work, but without a reprex, or a specific version that breaks things, it can be hard to figure out what is wrong.

@reisner
Copy link
Author

reisner commented Dec 4, 2024

Thanks for looking into this, @EmilHvitfeldt !

If you think this is something tidymodels should support, I can put together a better reprex for you. Since I still have the docker container for the old version, I can generate a model I can share that fails in the new versions. Let me know if you think it's worth looking into 😄

@EmilHvitfeldt
Copy link
Member

That would be great! Thank you!

@reisner
Copy link
Author

reisner commented Dec 6, 2024

OK, so I've created a reprex. There are two steps:

  1. Creating the model and recipe using the old OS, R version, and tidymodels packages. I save these in an .rds file, which I've attached here: model_pipeline.rds.zip - Download and unzip that file.
  2. In an updated OS, R version, and tidymodels packages, I read the .rds file and try to use the model and recipe.

For step 1 above, here is the environment information:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] bindrcpp_0.2.2   yardstick_0.0.2  tibble_1.4.2     rsample_0.0.3
 [5] tidyr_0.8.2      recipes_0.1.4    purrr_0.2.5      parsnip_0.0.1
 [9] infer_0.4.0      ggplot2_3.1.0    dials_0.0.2      scales_1.0.0
[13] broom_0.5.1      tidymodels_0.0.2 dplyr_0.7.8

loaded via a namespace (and not attached):
 [1] nlme_3.1-144         matrixStats_0.54.0   xts_0.11-2
 [4] lubridate_1.7.4      threejs_0.3.1        tidyposterior_0.0.2
 [7] rstan_2.18.2         SnowballC_0.5.1      backports_1.1.3
[10] tools_3.6.3          R6_2.3.0             DT_0.5
[13] rpart_4.1-15         lazyeval_0.2.1       colorspace_1.3-2
[16] nnet_7.3-12          withr_2.1.2          tidyselect_0.2.5
[19] gridExtra_2.3        prettyunits_1.0.2    processx_3.4.1
[22] compiler_3.6.3       cli_1.1.0            shinyjs_1.0
[25] tidypredict_0.2.1    colourpicker_1.0     dygraphs_1.1.1.6
[28] ggridges_0.5.1       callr_3.3.2          stringr_1.3.1
[31] digest_0.6.18        StanHeaders_2.18.0-1 minqa_1.2.4
[34] rstanarm_2.18.2      base64enc_0.1-3      pkgconfig_2.0.2
[37] htmltools_0.3.6      lme4_1.1-19          htmlwidgets_1.3
[40] rlang_0.3.0.1        rstudioapi_0.8       shiny_1.2.0
[43] bindr_0.1.1          generics_0.0.2       zoo_1.8-4
[46] crosstalk_1.0.0      gtools_3.8.1         tokenizers_0.2.1
[49] inline_0.3.15        magrittr_1.5         loo_2.0.0
[52] bayesplot_1.6.0      Matrix_1.2-18        Rcpp_1.0.0
[55] munsell_0.5.0        pROC_1.13.0          stringi_1.2.4
[58] MASS_7.3-51.5        pkgbuild_1.0.6       plyr_1.8.4
[61] grid_3.6.3           parallel_3.6.3       promises_1.0.1
[64] crayon_1.3.4         miniUI_0.1.1.1       lattice_0.20-38
[67] splines_3.6.3        ps_1.3.0             pillar_1.3.1
[70] igraph_1.2.2         markdown_0.9         shinystan_2.5.0
[73] reshape2_1.4.3       codetools_0.2-16     stats4_3.6.3
[76] rstantools_1.5.1     glue_1.3.0           tidytext_0.2.0
[79] nloptr_1.2.1         httpuv_1.4.5.1       gtable_0.2.0
[82] kernlab_0.9-27       assertthat_0.2.0     gower_0.1.2
[85] mime_0.6             prodlim_2018.04.18   xtable_1.8-3
[88] janeaustenr_0.1.5    later_0.7.5          rsconnect_0.8.12
[91] class_7.3-15         survival_3.1-8       timeDate_3043.102
[94] shinythemes_1.1.2    lava_1.6.4           ipred_0.9-8

$ uname -a
Linux 431d694c03f6 6.10.11-linuxkit #1 SMP Thu Oct  3 10:17:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$  lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic

Here is the code used to run Step 1. You do not need to run this, since I've attached the file that is generated by the code. I'm only putting it here for reference.

library(dplyr)
library(tidymodels)

train_df = mtcars %>%
  mutate(label = (vs == 0)) %>%
  select(-vs) %>%
  mutate(am = paste0("am is ", am)) # Use for one-hot encoding recipe

model_outcome = "label"
model_inputs = setdiff(names(train_df), model_outcome)
model_vars = c(model_outcome, model_inputs)
var_roles = c(rep("outcome", length(model_outcome)),
              rep("predictor", length(model_inputs)))

cat("Creating Recipe...\n")
recipe_out = recipe(x = train_df, vars = model_vars, roles = var_roles) %>%
  step_dummy(am, one_hot = TRUE) %>%
  step_meanimpute(all_numeric())

cat("Baking Recipe...\n")
trained_recipe = prep(recipe_out, training = train_df)
prepped_training_data = bake(trained_recipe, new_data = train_df)

cat("Training SVM...\n")
model_settings = svm_poly(degree = 1, scale_factor = 0.1, cost = 1) %>% 
  set_engine("kernlab")
model = fit(model_settings, label ~ ., data = prepped_training_data)

model_pipeline = list(
  model = model,
  recipe = trained_recipe
)

cat("Saving Pipeline...\n")
saveRDS(model_pipeline, file = "model_pipeline.rds")

For step 2 above, here is my environment information:

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Edmonton
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_1.3.1    workflowsets_1.1.0 workflows_1.1.4    tune_1.2.1        
 [5] tidyr_1.3.1        tibble_3.2.1       rsample_1.2.1      recipes_1.1.0     
 [9] purrr_1.0.2        parsnip_1.2.1      modeldata_1.4.0    infer_1.0.7       
[13] ggplot2_3.5.1      dials_1.3.0        scales_1.3.0       broom_1.0.7       
[17] tidymodels_1.2.0   dplyr_1.1.4       

loaded via a namespace (and not attached):
 [1] gtable_0.3.5        lattice_0.21-8      vctrs_0.6.5        
 [4] tools_4.3.0         generics_0.1.3      parallel_4.3.0     
 [7] fansi_1.0.6         pkgconfig_2.0.3     Matrix_1.5-4       
[10] data.table_1.16.0   lhs_1.2.0           GPfit_1.0-8        
[13] lifecycle_1.0.4     compiler_4.3.0      munsell_0.5.1      
[16] codetools_0.2-19    DiceDesign_1.10     class_7.3-21       
[19] prodlim_2024.06.25  pillar_1.9.0        furrr_0.3.1        
[22] MASS_7.3-58.4       gower_1.0.1         iterators_1.0.14   
[25] rpart_4.1.19        foreach_1.5.2       parallelly_1.38.0  
[28] lava_1.8.0          tidyselect_1.2.1    digest_0.6.37      
[31] future_1.34.0       kernlab_0.9-33      listenv_0.9.1      
[34] splines_4.3.0       grid_4.3.0          colorspace_2.1-1   
[37] cli_3.6.3           magrittr_2.0.3      survival_3.5-5     
[40] utf8_1.2.4          future.apply_1.11.2 withr_2.5.0        
[43] prettyunits_1.2.0   backports_1.5.0     lubridate_1.9.3    
[46] timechange_0.3.0    globals_0.16.3      nnet_7.3-18        
[49] timeDate_4041.110   hardhat_1.4.0       rlang_1.1.4        
[52] Rcpp_1.0.10         glue_1.8.0          ipred_0.9-15       
[55] rstudioapi_0.16.0   R6_2.5.1       

$ uname -a
Linux a8eb45bbfc11 6.10.11-linuxkit #1 SMP Thu Oct  3 10:17:28 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

And finally, here is the code for running step 2, where I'm encountering 2 issues:

library(dplyr)
library(tidymodels)

model_pipeline = readRDS('model_pipeline.rds')
model = model_pipeline$model
recipe = model_pipeline$recipe

# This works:
print(recipe)

# Issue 1: Printing the model fails with an error, unless we set elapsed:
model$elapsed[["elapsed"]] = 0
print(model)

# Create Test Dataframe:
test_df = mtcars %>%
  mutate(label = (vs == 0)) %>%
  select(-vs) %>%
  mutate(am = paste0("am is ", am)) # Create a categorical field for one-hot encoding recipe

# Issue 2: Trying to run the recipe on the test with updated packages:
prepped_test_df = bake(recipe, new_data = test_df)

I've been able to address "Issue 1" in that example, and perhaps parsnip could handle this internally? Maybe with a warning?

However, I'm not sure what to do with "Issue 2" in that example. When I try running bake with that recipe object, I get this error:

Error in group_data(x) : 
  `.data` must be a valid <grouped_df> object.
Caused by error in `validate_grouped_df()`:
! Corrupt `grouped_df` using old (< 0.8.0) format.
ℹ Strip off old grouping with `ungroup()`.
In addition: Warning message:
`keep_original_cols` was added to `step_dummy()` after this recipe was created.
ℹ Regenerate your recipe to avoid this warning.  

Thanks so much for your help! Let me know if I can do anything else to help.

@topepo
Copy link
Member

topepo commented Dec 19, 2024

Perhaps related to either tidyverse/dplyr#1341 or tidyverse/dplyr#1405?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

3 participants