-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646
Comments
Many thanks for the great report and I've downloaded your RDS from S.O. There was a change to
The verbose output in the S.O. (thanks) show the following :
Which is more a clue for us to fix in this case, than for the user. |
I had the same initial thought for the cause, bit SO also reported the same slowdown for an unlist command. Anyway there's repro data now which makes all the difference. Thanks @fabiocs8 ! |
Glad to help you.
I would like to deactivate the link for the dataset, Have you already
make a copy of it?
Regards, fabio.
…On 27/07/2020 20:53, Michael Chirico wrote:
I had the same initial thought for the cause, bit SO also reported the
same slowdown for an unlist command.
Anyway there's repro data now which makes all the difference. Thanks
@fabiocs8 <https://github.com/fabiocs8> !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4646 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5ISCJMUQHWCF657E56FRTR5YHRDANCNFSM4PIXKGRA>.
--
*Tecknowledge Consultoria e Serviços LTDA*
Fábio Corrêa da Silva
+ 55 11 98771-3800 | [email protected]
|
Glad to help.
Because you have downloaded the dataset, I will deactivate the link.
Please let me know whether I can be of further help.
Regards,
Fabio.
…On 27/07/2020 17:16, Matt Dowle wrote:
Thank you for the report and I've downloaded your RDS from S.O.
There was a change to |[[| by group in this release so we'll have to
look into that.
GForce is deactivated for [[ on non-atomic input, part of #4159
<#4159>. Thanks
@hongyuanjia <https://github.com/hongyuanjia> and @ColeMiller1
<https://github.com/ColeMiller1> for helping debug an issue in dev
with the original fix before release, #4612
<#4612>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4646 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5ISCPE2P37HOY42I2S75TR5XOBJANCNFSM4PIXKGRA>.
--
*Tecknowledge Consultoria e Serviços LTDA*
Fábio Corrêa da Silva
+ 55 11 98771-3800 | [email protected]
|
Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great! The most interesting thing in 1.13.0 is that there is a major performance difference between
It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g.
So it appears to be related to In the meantime - here's a way for simulated data to reproduce the issue: library(data.table) #1.13.0
set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(id = seq_len(n),
value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))
bench::mark(
dt[seq_len(.N), value[[1L]], by = id]
, dt[, value[[1L]], by = id]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 dt[seq_len(.N), value[[1L]], by = id] 2.56ms 3.05ms 287. 2.24MB
#> 2 dt[, value[[1L]], by = id] 219.41ms 250.98ms 3.98 51.91MB
#> # ... with 1 more variable: `gc/sec` <dbl> Using
|
I noticed other related weird behaviors. All the commands below were run under V 1.12.8 on the full database: 1) Problems when using item[, lance[[1]], by = unnest_names,verbose = T], see below:
Detected that j uses these columns: lance
Note 1) that data.table has not issued any error or warning, and 2) the output number of rows is the same of the input data (872,851), it does not make sense (the output number of rows should be 16,070,070). 2) item_int <- item[, unlist(lance, recursive = F), by = item_id] runs smooth and correctly
Detected that j uses these columns: lance memcpy contiguous groups took 0.074s for 872581 groups Unit: seconds 3) item_int <- item[, rbindlist(lance), by = unnest_names] runs correctly, but takes a long time... does it make sense?
Detected that j uses these columns: lance memcpy contiguous groups took 0.113s for 872581 groups Regards, Fabio. |
Additional research: #4164 is the commit where this started. ## merged commit before #4164
##remotes::install_github("https://github.com/Rdatatable/data.table/tree/793f8545c363d222de18ac892bc7abb80154e724") # good
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
##1 dt[seq_len(.N), value[[1L]], by = id] 3.68ms 4.06ms 232. 2.26MB 5.11 91 2 392ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [260 ~ <bch:tm [93~ <tibble [93 x ~
##2 dt[, value[[1L]], by = id] 2.06ms 2.29ms 430. 401.24KB 4.16 207 2 481ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [17 x~ <bch:tm [20~ <tibble [209 x~
## final commit with merged #4164
# remotes::install_github("https://github.com/Rdatatable/data.table/tree/4aadde8f5a51cd7c8f3889964e7280432ec65bbc") #bad - merged commit
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
##1 dt[seq_len(.N), value[[1L]], by = id] 1.25ms 1.39ms 572. 2.26MB 12.0 286 6 500ms <data.table[,2] [20,000~ <Rprofmem[,3] [260 x 3~ <bch:tm [2~ <tibble [286 ~
##2 dt[, value[[1L]], by = id] 173.01ms 188.94ms 5.29 52.03MB 5.29 3 3 567ms <data.table[,2] [20,000~ <Rprofmem[,3] [251,669~ <bch:tm [3~ <tibble [3 x ~ The last thing I looked into is that if Code between discontinuous vs. continuous: Lines 161 to 179 in db61844
Performance difference: library(data.table) #1.13.0
set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(ordered_id = seq_len(n),
unordered_id = sample(n),
value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))
bench::mark(
dt[, value[[1L]], by = ordered_id]
, dt[, value[[1L]], by = unordered_id]
, check = FALSE
, time_unit = "ms"
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt>
#> 1 dt[, value[[1L]], by = ordered_id] 93.5 124. 8.25 53.7MB
#> 2 dt[, value[[1L]], by = unordered_id] 0.634 1.69 629. 409.2KB
#> # ... with 1 more variable: `gc/sec` <dbl> I will work on a PR for |
I have a script that takes only 4 seconds to process in V 1.12.8, and more than 24+ hours when using V 1.13.0. All other variables are the same (computer, r and RStudio versions, dataset, etc.)
All relevant and detailed information about this issue is contained in Stack Overflow.
https://stackoverflow.com/questions/63105711/why-data-table-unnesting-time-grows-with-the-square-of-number-of-rows-for-a-spec
I am not a programmer, so I am not able to go through the code. Below you will find the sessionInfo() reg\rding thje environment AFTER changing back to V 1.12.8.
Please let me know whether I can be of any further help.
BTW, congratulations for this amazing gamecahnger package.
Regards,
Fabio.
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8 lubridate_1.7.9 stringi_1.4.6 runner_0.3.7 e1071_1.7-3
[7] ggplot2_3.3.2 stringr_1.4.0 magrittr_1.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 class_7.3-17 tools_4.0.2 digest_0.6.25 packrat_0.5.0 evaluate_0.14
[9] lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 rstudioapi_0.11 yaml_2.2.1 xfun_0.16
[17] withr_2.2.0 dplyr_1.0.0 knitr_1.29 generics_0.0.2 vctrs_0.3.2 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
[25] R6_2.4.1 rmarkdown_2.3 purrr_0.3.4 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 colorspace_1.4-1 tinytex_0.25
[33] munsell_0.5.0 crayon_1.3.4
The text was updated successfully, but these errors were encountered: