Performance testing

This page describes the plan for performance testing.

atime overview and example

We use atime on GitHub actions for performance testing.

the action is defined in https://github.com/Rdatatable/data.table/blob/master/.github/workflows/performance-tests.yml

the test cases are in https://github.com/Rdatatable/data.table/blob/master/.ci/atime/tests.R

A performance test template is

  # Comments with links to related issues/PRs (delete this line)
  # Issue reported in https://github.com/Rdatatable/data.table/issues/4498
  # To be fixed in: https://github.com/Rdatatable/data.table/pull/4501
  "Short test name" = atime::atime_test(
    # arguments (N, setup, expr, pkg.edit.fun) to pass to atime_versions (delete this line)
    N = 10^seq(5, 7),
    setup = {
        L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
        setkey(L, V1)
    },
    expr = {
        x[, .SD]
    },
    Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
    Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)

Note above that each SHA1 git version is specified with an argument name which will appear in the output, and a comment so we can later understand where the SHA1 came from.

If there is a historical commit/PR known to have caused a regression, then use the argument names Before, Regression, Fixed. Before and Fixed should be fast and Regression should be slow.
If there is no known point in the past which was fast, then use the argument names Fast and Slow. Fast is a new version after the fix, and Slow is an old version before the fix.

Documentation of historical commits

For each historical commit, make sure the corresponding comment has links to github web pages where we can see where this SHA1 came from. There are two such types of links

For "Some commit in the PR" just provide a link to the PR commits tab page, such as https://github.com/Rdatatable/data.table/pull/4501/commits
For "Parent of some commit" use a link to the commit ID, such as https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098 is a commit link
- For example above we have Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
- If a PR is mentioned, make sure to include a link to its commits tab, such as https://github.com/Rdatatable/data.table/pull/4501/commits so we can see that 74636333d7da965a11dad04c322c752a409db098 is really the first commit in that PR
- and the https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098 page shows that cacdc92df71b777369a217b6c902c687cf35a70d is the parent
- do not include the link to the parent commit https://github.com/Rdatatable/data.table/commit/cacdc92df71b777369a217b6c902c687cf35a70d as that does not help understanding how the commit is related to the PR/issue, and it is redundant with the code which already has that ID.
- do not include a link which includes data.table/pull/number/commits/sha (does not show parent) instead use data.table/commit/sha because that page shows the parent
  - https://github.com/Rdatatable/data.table/pull/4501/commits/74636333d7da965a11dad04c322c752a409db098 does not show parent
  - https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098 does show parent

Also you should use data.table:::[.data.table instead of [ because the different versions work by substituting data.table:: with data.table.someSHAversion:: (a different package name is created and installed for each version, and to have that package work you need to provide pkg.edit.fun https://github.com/Rdatatable/data.table/blob/master/.ci/atime/tests.R#L43)

Running locally

To run all performance tests locally on your machine, use atime::atime_pkg("path/to/data.table")
To run a single performance test on your machine, just copy the same arguments as atime_test to atime_versions,

vinfo <- atime::atime_versions(
   pkg.edit.fun = pkg.edit.fun, 
    N = 10^seq(1, 7, by=0.5),
    setup = {
        L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
        setkey(L, V1)
    },
    expr = {
        data.table:::`[.data.table`(L, , .SD)
    },
    Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR that fixes the issue 
    Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)
refs <- atime::references_best(vinfo)
plot(refs)
pred <- predict(refs)
plot(pred)

Alternatively you can actually evaluate the code in atime/tests.R and then use do.call as below

do.call(atime::atime_versions, c(test.list[["setDT improved in #5427"]], pkg.path="path/to/data.table"))

Past issues

Performance testing was done in some cases:

dcast issue plots code
:= issue plots code

The discussions/code above make use of the atime R package, which can measure time/memory for different versions of data.table, so we can use it to compare master to release to PR.

Michael says that it is important to run perf tests on several platforms - link and on GitHub CI as of 2024 we are actually only using Linux.
The doc says that win/mac/linux images are provided by GitHub Actions.

Related team

A team, Performance Testers is assigned to one who is actively involved with the performance testing aspects of data.table. Responsibilities that fall under this specialized role can include, but are not restricted to:

Evaluating the scalability of data.table functions to track how they perform as datasets grow asymptotically.
Running comparative performance benchmarks to portray the relative efficiency of operations, i.e., in contrast to other packages that achieve similar functionality as data.table.
Writing open-source material like blog posts to document such (with code to run the benchmarks provisioned therein), as they would tend to be a great resource for the community. Examples: df-atime-figures, df-partial-match
Designing test scenarios to measure performance, such as handling large datasets, performing complex queries, having concurrent operations, etc.
Staying informed with the latest developments in R programming and performance testing methodologies to bring such updates to data.table.

Wiki Home
Getting started
Events: Videos & Slides
Articles
Installation
Support
Revdep checks
?data.table ?fread ?fwrite
fread for small data
Do's and Don'ts
Performance Testing
Triage Management
Translations
Hindi translations planning
#rdatatable

Provide feedback

Saved searches