Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table performance regression 1534 #4

Open
DorisAmoakohene opened this issue Oct 9, 2023 · 7 comments
Open

data.table performance regression 1534 #4

DorisAmoakohene opened this issue Oct 9, 2023 · 7 comments

Comments

@DorisAmoakohene
Copy link
Owner

@tdhock kindly help me to restore this branch #Rdatatable/data.table#1534

atime.list <- atime::atime_versions(
pkg.path="C:/Users/Doris Afriyie/data.table",
pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
      pkg_find_replace <- function(glob, FIND, REPLACE){
        atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
      }
      Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
      Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
      new.Package_ <- paste0(Package_, "_", sha)
      pkg_find_replace(
        "DESCRIPTION", 
        paste0("Package:\\s+", old.Package),
        paste("Package:", new.Package))
      pkg_find_replace(
        file.path("src","Makevars.*in"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        sprintf('packageVersion\\("%s"\\)', old.Package),
        sprintf('packageVersion\\("%s"\\)', new.Package))
      pkg_find_replace(
        file.path("src", "init.c"),
        paste0("R_init_", Package_regex),
        paste0("R_init_", gsub("[.]", "_", new.Package_)))
      pkg_find_replace(
        "NAMESPACE",
        sprintf('useDynLib\\("?%s"?', Package_regex),
        paste0('useDynLib(', new.Package_))
    },
  N=10^seq(3,8),
  setup={ 
    n <- N/1e6
    set.seed(1)
    dt <- data.table(Grp = rep(seq_len(n), each=10L))
    dt[, Value := sample(100L, size = .N, replace = TRUE)]
  },#https://github.com/Rdatatable/data.table/issues/1534
expr=data.table:::`[.data.table`(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][], dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][]),
  "Before"="58135017a985f3cc2c6f0d091c4effaec4442f56",#https://github.com/Rdatatable/data.table/tree/58135017a985f3cc2c6f0d091c4effaec4442f56
  "Fixed"="5eec8f2d00521edb0a4a088ad154d065daa869c6") #fixed:#https://github.com/Rdatatable/data.table/tree/5eec8f2d00521edb0a4a088ad154d065daa869c6. for fixed.

I'm trying to reproduce but gives me an error which its seem to me that the branch has been closed

"Error in value[3L] :
Error in revparse_single(object, branch): Error in 'git2r_revparse_single': Requested object could not be found

when trying to checkout 58135017a985f3cc2c6f0d091c4effaec4442f56"

@tdhock
Copy link

tdhock commented Oct 10, 2023

does it work now?

@DorisAmoakohene
Copy link
Owner Author

@tdhock
there is no package called ‘data.table.58135017a985f3cc2c6f0d091c4effaec4442f56’

its seems the id i choose for the fix is not it so i'm getting the right fixed id

@DorisAmoakohene
Copy link
Owner Author

DorisAmoakohene commented Oct 11, 2023

@tdhock

so I revisited the commit Id since the one i picked where not the exact SHA,

atime.list <- atime::atime_versions(
pkg.path="C:/Users/Doris Afriyie/data.table",
pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
      pkg_find_replace <- function(glob, FIND, REPLACE){
        atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
      }
      Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
      Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
      new.Package_ <- paste0(Package_, "_", sha)
      pkg_find_replace(
        "DESCRIPTION", 
        paste0("Package:\\s+", old.Package),
        paste("Package:", new.Package))
      pkg_find_replace(
        file.path("src","Makevars.*in"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        sprintf('packageVersion\\("%s"\\)', old.Package),
        sprintf('packageVersion\\("%s"\\)', new.Package))
      pkg_find_replace(
        file.path("src", "init.c"),
        paste0("R_init_", Package_regex),
        paste0("R_init_", gsub("[.]", "_", new.Package_)))
      pkg_find_replace(
        "NAMESPACE",
        sprintf('useDynLib\\("?%s"?', Package_regex),
        paste0('useDynLib(', new.Package_))
    },
  N=10^seq(3,8),
  setup={ 
    n <- N/1e6
    set.seed(1)
    dt <- data.table(Grp = rep(seq_len(n), each=10L))
    dt[, Value := sample(100L, size = .N, replace = TRUE)]
  },#https://github.com/Rdatatable/data.table/issues/1534
expr=data.table:::`[.data.table`(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][], dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][]),
  "Before"="f5aad19fea63b8ee9c0d33fe3f019172135dcb6a",#https://github.com/Rdatatable/data.table/pull/5205/commits/f5aad19fea63b8ee9c0d33fe3f019172135dcb6a
  "Fixed"="a6abac319446ae7dde8bc4501fae40eeb5cc228c") #fixed:#https://github.com/Rdatatable/data.table/pull/5205/commits/a6abac319446ae7dde8bc4501fae40eeb5cc228c.

i run the new commit id and I'm still getting an error message

"Error: When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM."

find links to the commit id in the code

https://github.com/Rdatatable/data.table/pull/5205/commits - link to the commit

@tdhock
Copy link

tdhock commented Oct 11, 2023

The code from Rdatatable/data.table#1534 is

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

The first timed expression dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][] is the slow one for which we want to test different versions.

The second timed expression dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][] is the alternative implementation which is faster (no need to run for different versions, but could run as a comparison, to see if the other one becomes just as fast, after applying the fix).

Your atime code has the same issue as the other code we discussed earlier today.

expr=data.table:::`[.data.table`(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][], dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][]),

in code above, expr contains both expressions, but should contain only the first, as below:

data.table:::`[.data.table`(dt, , PrevValueByGrp := shift(Value, type = "lag"), by = Grp)

Triple colon data.table::: prefix is necessary for atime_versions, which will change it into data.table.21abc12891298etc::: based on the versions you specify.

@tdhock
Copy link

tdhock commented Oct 11, 2023

also you should double check the N values, which seem to be smaller than 1 sometimes,

N=10^seq(3,8),
  setup={ 
    n <- N/1e6

@DorisAmoakohene
Copy link
Owner Author

okay, got it

@DorisAmoakohene
Copy link
Owner Author

DorisAmoakohene commented Oct 11, 2023

@tdhock,

atime.list <- atime::atime_versions(
pkg.path="C:/Users/Doris Afriyie/data.table",
pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
      pkg_find_replace <- function(glob, FIND, REPLACE){
        atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
      }
      Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
      Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
      new.Package_ <- paste0(Package_, "_", sha)
      pkg_find_replace(
        "DESCRIPTION", 
        paste0("Package:\\s+", old.Package),
        paste("Package:", new.Package))
      pkg_find_replace(
        file.path("src","Makevars.*in"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        sprintf('packageVersion\\("%s"\\)', old.Package),
        sprintf('packageVersion\\("%s"\\)', new.Package))
      pkg_find_replace(
        file.path("src", "init.c"),
        paste0("R_init_", Package_regex),
        paste0("R_init_", gsub("[.]", "_", new.Package_)))
      pkg_find_replace(
        "NAMESPACE",
        sprintf('useDynLib\\("?%s"?', Package_regex),
        paste0('useDynLib(', new.Package_))
    },
  N=10^seq(3,8),
  setup={ 
    n <- N/1e6
    set.seed(1)
    dt <- data.table(Grp = rep(seq_len(n), each=10L))
    dt[, Value := sample(100L, size = .N, replace = TRUE)]
  },#https://github.com/Rdatatable/data.table/issues/1534
expr=data.table:::`[.data.table`(dt, , PrevValueByGrp := shift(Value, type = "lag"), by = Grp),
  "Before"="a6abac319446ae7dde8bc4501fae40eeb5cc228c",##https://github.com/Rdatatable/data.table/pull/5205/commits/a6abac319446ae7dde8bc4501fae40eeb5cc228c
  "Fixed"="f5aad19fea63b8ee9c0d33fe3f019172135dcb6a") #fixed:##https://github.com/Rdatatable/data.table/pull/5205/commits/f5aad19fea63b8ee9c0d33fe3f019172135dcb6a .

i plotted the new code and this is the result I'm getting, any comment, corrections or submissions
Screenshot 2023-10-10 210048

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants