Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transform is ~100x slower on data.table than on data.frame #5492 #2

Open
DorisAmoakohene opened this issue Mar 29, 2024 · 9 comments
Open

Comments

@DorisAmoakohene
Copy link
Owner

@tdhock
similar to the other issue I sent,
I am running this performance test and I am encountering this error which is I am not able to install of the commit ids

This is the link to the PR that fixes the issue (Rdatatable/data.table#5493) and the link to the issue (Rdatatable/data.table#5492)

Error message

Error in value[[3L]](cond) : 
  Error in revparse_single(object, branch): Error in 'git2r_revparse_single': Requested object could not be found

 when trying to checkout 93ce3ce1373bf733ebd2036e2883d2ffe377ab58

This is the code I am running


atime.list.5493 <- atime::atime_versions(
pkg.path=tdir,
pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
      pkg_find_replace <- function(glob, FIND, REPLACE){
        atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
      }
      Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
      Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
      new.Package_ <- paste0(Package_, "_", sha)
      pkg_find_replace(
        "DESCRIPTION", 
        paste0("Package:\\s+", old.Package),
        paste("Package:", new.Package))
      pkg_find_replace(
        file.path("src","Makevars.*in"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        sprintf('packageVersion\\("%s"\\)', old.Package),
        sprintf('packageVersion\\("%s"\\)', new.Package))
      pkg_find_replace(
        file.path("src", "init.c"),
        paste0("R_init_", Package_regex),
        paste0("R_init_", gsub("[.]", "_", new.Package_)))
      pkg_find_replace(
        "NAMESPACE",
        sprintf('useDynLib\\("?%s"?', Package_regex),
        paste0('useDynLib(', new.Package_))
    },

  N=10^seq(1,20),
  setup={ 
    set.seed(108)
    df <- data.frame(x = runif(N))
    dt <- as.data.table(df)
    
  },

  expr=data.table:::`[.data.table`(transform(dt, y = round(x))),
"Before"="93ce3ce1373bf733ebd2036e2883d2ffe377ab58",#fIRST COMMIT IN THE PR(https://github.com/Rdatatable/data.table/pull/5493/commits)
  "Regression"="0bacebc9b813d84b9b267e0928b5fd7c7ea126fb", #PARENT OF THE LAST OF THE PR THAT BFIXES THE ISSUE (https://github.com/Rdatatable/data.table/commit/1e03fe7b890e63da9651d997ea52548c90b3ae32)
  "Fixed"="1e03fe7b890e63da9651d997ea52548c90b3ae32")# LAST COMMIT IN THE PR THAT FIXES THE ISSUE(https://github.com/Rdatatable/data.table/pull/5493/commits)

@tdhock
Copy link

tdhock commented Mar 30, 2024

the "not found" comes from the branch being in OfekShilon's fork, so you would have to make a new branch/pr with a copy of those commits in the main Rdatatable/data.table repo.

but actually there is a problem with Before:

"Before"="93ce3ce1373bf733ebd2036e2883d2ffe377ab58",#fIRST COMMIT IN THE PR(https://github.com/Rdatatable/data.table/pull/5493/commits)

Before should not be any commit in the PR which fixes the issue, because any of those commits could have actually provided the fix. Instead, please take Before from the parent of the first commit in the PR which causes the issue (any of the commits in that PR could have started the regression).

Does that make sense?

@tdhock
Copy link

tdhock commented Mar 30, 2024

  "Regression"="0bacebc9b813d84b9b267e0928b5fd7c7ea126fb", #PARENT OF THE LAST OF THE PR THAT BFIXES THE ISSUE (https://github.com/Rdatatable/data.table/commit/1e03fe7b890e63da9651d997ea52548c90b3ae32)

is not properly documented, can you please fix?
Can you please use normal capitalization instead of all caps? PARENT -> Parent
Regression can be the parent of the first commit in the PR that fixes the issue (not last).
Please provide a link to both the PR commit page https://github.com/Rdatatable/data.table/pull/5493/commits so we can see the first commit id which is 93ce3ce1373bf733ebd2036e2883d2ffe377ab58
and the commit page for the first commit Rdatatable/data.table@93ce3ce so we can see the parent commit id, which is Rdatatable/data.table@0895fa2

@tdhock
Copy link

tdhock commented Mar 30, 2024

also please ask any questions required to clarify your understanding. We discussed this on Tuesday and I thought you understood already?

@DorisAmoakohene
Copy link
Owner Author

DorisAmoakohene commented Apr 1, 2024

Apologies, I think i mixed the commits of the Before.

Yes, suppose we have Commit A (Before), Commit B (Regression) and commit C (fix), Commit A represents the initial state where the issue was reported. Commit B is the regression commit that introduced the problem, and Commit C is the fix commit that resolves the issue. Commit B would typically have Commit A as its parent, indicating that it is based on the state of the code at Commit A. Commit C, on the other hand, would typically have Commit B as its parent, indicating that it is based on the state of the code at Commit B.
Our last meeting you also mentioned that the Regression can be the parent of any of the commits in the fixed PR, and I take the last commit in the PR as the one that fixes the issue.
Additionally, the Before commits can be the PR that reported the issue.
However, I would like you to clarify if the Before commits can also be the parent of the Regression commits?

@DorisAmoakohene
Copy link
Owner Author

  "Regression"="0bacebc9b813d84b9b267e0928b5fd7c7ea126fb", #PARENT OF THE LAST OF THE PR THAT BFIXES THE ISSUE (https://github.com/Rdatatable/data.table/commit/1e03fe7b890e63da9651d997ea52548c90b3ae32)

is not properly documented, can you please fix? Can you please use normal capitalization instead of all caps? PARENT -> Parent Regression can be the parent of the first commit in the PR that fixes the issue (not last). Please provide a link to both the PR commit page https://github.com/Rdatatable/data.table/pull/5493/commits so we can see the first commit id which is 93ce3ce1373bf733ebd2036e2883d2ffe377ab58 and the commit page for the first commit Rdatatable/data.table@93ce3ce so we can see the parent commit id, which is Rdatatable/data.table@0895fa2

got it, thanks

@tdhock
Copy link

tdhock commented Apr 1, 2024

you wrote "Regression can be the parent of any of the commits in the fixed PR," which is incorrect. Regression can only be the parent of the first commit in the Fixed PR, because any of the commits in that PR could have provided the fix.
"I take the last commit in the PR as the one that fixes the issue." that is possibly incorrect, as it could have been the first commit in the PR which fixes the issue. Taking the parent of the first commit is the only way to be sure.

"Additionally, the Before commits can be the PR that reported the issue." -> this does not make sense. issues are reported in issues, not PRs.

"However, I would like you to clarify if the Before commits can also be the parent of the Regression commits?" Before commit can be the parent of the first comit in the PR which caused the regression (because any of the commits in that PR may be responsible for the regression).

@DorisAmoakohene
Copy link
Owner Author

DorisAmoakohene commented Apr 1, 2024

"Before"="19b7866112614db53eb3e909c097407d91cd6738",#parent of the regression commit (https://github.com/Rdatatable/data.table/commit/93ce3ce1373bf733ebd2036e2883d2ffe377ab58), which is the parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/pull/5493/commits) this is the link to the commit(https://github.com/Rdatatable/data.table/commit/19b7866112614db53eb3e909c097407d91cd6738)

  "Regression"="0895fa247afcf6b38044bd5f56c0d209691ddb31", #This commit(https://github.com/Rdatatable/data.table/commit/93ce3ce1373bf733ebd2036e2883d2ffe377ab58) is parent of the first commit of the PR that fixes the issue  (https://github.com/Rdatatable/data.table/pull/5493/commits)

  "Fixed"="1e03fe7b890e63da9651d997ea52548c90b3ae32")# last commit of the PR that fixes the issue (https://github.com/Rdatatable/data.table/pull/5493/commits)
  
  From the above, if I have commits like this? are these commits right?

@tdhock
Copy link

tdhock commented Apr 2, 2024

Before needs improvement/fix. Before can be parent of first commit in PR which causes the issue (you wrote fixes instead of causes). also please do not write "this is a link to the commit 19b" in the comment, as that is redundant with the Before="19b..." in the code. (potentially confusing)

Regression doc needs improvement. You wrote "This commit (93ce3ce) is parent of the first commit of the PR that fixes" but actually 93ce3ce is the first commit (not parent) of the PR #5493, can you please fix by moving parentheses? #parent of first commit (93ce3ce) etc

@tdhock
Copy link

tdhock commented Apr 2, 2024

also please use triple quotes when writing R code in issue comments:

```r
Before="93ce3etc"
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants