Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace integers with explicit integers (1 -> 1L, etc.) #2573

Merged
merged 19 commits into from
Feb 6, 2018

Conversation

MichaelChirico
Copy link
Member

Part of #2572, tried to make sure I respected "strong typing" (don't force integer on potential numerics).

Command to find these:

#numeric ones (1 -> 1L)
grep '^[^#]*[^a-zA-Z_]1[^L]' R/* > ~/Desktop/all1s.out
#for all integers
grep '^[^#]*[^a-zA-Z_0-9][0-9][^0-9L]' R/* > ~/Desktop/numeric_integers.out

@mattdowle
Copy link
Member

mattdowle commented Jan 18, 2018

Very nice!
I'm not sure the status of global small integer constants in base R (i.e. whether altrep got to that or not and if that's been merged to R-devel) but I know it was planned. Either way, this PR is needed to have the best chance to benefit from that.
Interested to see the time of test.data.table() (without Suggests packages, just plain DT tests) before and after. But even if no speedup on that, it's still worth doing.

@MichaelChirico
Copy link
Member Author

head and tail GForce optimization failures are legit:

dt = data.table(
  x  = sample(letters, 300, TRUE),
  d1 = as.numeric(sample(-10:10, 300, TRUE)),
  d2 = as.numeric(sample(c(NA, NaN, -10:10), 300, TRUE))
)
# not optimized now
dt[x %in% letters[15:20], head(.SD,1), by=x, verbose=TRUE]
# still optimized
dt[x %in% letters[15:20], head(.SD,1L), by=x, verbose=TRUE]
#idem
dt[x %in% letters[15:20], tail(.SD,1), by=x, verbose=TRUE]
dt[x %in% letters[15:20], tail(.SD,1L), by=x, verbose=TRUE]

Doesn't make sense to require users to run head(.SD, 1L) to benefit from optimization (even if it's best practice); GForce should detect and convert.

Investigating, though I'm in the dark here for a moment. Sometimes strong typing would be nice 😬

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 18, 2018

Also, should probably turn the lines like:

if (verbose) {cat(round(proc.time()[3L]-last.started.at, 3L),"secs\n");flush.console}

into a function since they're so common -- @mattdowle IIUC this is what the timetaken function is for?

@mattdowle edit : yes, it should be using timetaken there.

R/data.table.R Outdated
as.character(q[[1]]) %chin% "[" && is.numeric(q[[3]]) &&
length(q[[3]])==1 && q[[3]]>0 )
ans = cond && length(q)==3L && ( as.character(q[[1L]]) %chin% c("head", "tail") &&
(identical(q[[3L]], 1L) || identical(q[[3L]], 1L)) ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this can't just be q[[3L]] == 1L?

R/data.table.R Outdated
ans = cond && length(q) == 3L &&
length(q[[3L]]) == 1L && is.numeric(q[[3L]]) && (
as.character(q[[1L]]) %chin% c("head", "tail") && q[[3L]] == 1L ||
as.character(q[[1L]]) %chin% "[" && q[[3L]] > 0 )
Copy link
Member Author

@MichaelChirico MichaelChirico Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic now seems more natural to me (in particular, I see no reason to check is.numeric and length conditions only for the [ case and not for the head/tail case -- perhaps a few tests are in order?), and the code reads better too.

Copy link
Member

@mattdowle mattdowle Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of identical may have had NA in mind. Also, for the head / tail case, is.numeric and length were being checked via identical(). Yes more tests would be great. Seems like codecov agrees and is not-passing which is good (confirms it's a good exercise).
The line needs a comment saying what it's supposed to be doing, doesn't it!
Can you remove the added spaces around == please. To my eyes there is now lots of separate things all with space between them and it's harder to see the tree of logic. = is already a different character to letters (and is not valid in symbols) so = is already a visual separator. Removing the spaces around == allows you to budge the pieces together in this case, so that the remaining spaces are more helpful.
Isn't head and tail with n>1 now going to pass? That doesn't seem to be the intent of the original line (only n==1L and n==1 were allowed, iiuc). Ok I see the q[[3L]]==1L now.
An extra set of parens around the left and right of the last || would be clearer. There was a paren around the left of that || in the original but not on its right.

@codecov-io
Copy link

codecov-io commented Jan 18, 2018

Codecov Report

Merging #2573 into master will increase coverage by 1.45%.
The diff coverage is 94.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2573      +/-   ##
==========================================
+ Coverage   91.49%   92.94%   +1.45%     
==========================================
  Files          63       61       -2     
  Lines       12229    12107     -122     
==========================================
+ Hits        11189    11253      +64     
+ Misses       1040      854     -186
Impacted Files Coverage Δ
R/fmelt.R 100% <ø> (ø) ⬆️
R/utils.R 82.6% <ø> (ø) ⬆️
R/as.data.table.R 100% <100%> (ø) ⬆️
R/foverlaps.R 100% <100%> (+5.69%) ⬆️
R/getdots.R 100% <100%> (ø) ⬆️
R/bmerge.R 100% <100%> (ø) ⬆️
R/merge.R 93.93% <100%> (ø) ⬆️
R/print.data.table.R 98.13% <100%> (ø) ⬆️
R/groupingsets.R 100% <100%> (ø) ⬆️
R/IDateTime.R 85.61% <100%> (+5.47%) ⬆️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05c5122...5be3049. Read the comment docs.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 18, 2018

@mattdowle Ran this to test speed:

library(data.table)
library(magrittr)
replicate(10, {
  capture.output(suppressMessages(test.data.table())) %>% 
    capture.output %>% 
    grep('completed ok', ., value = TRUE) %>%
    gsub('.*completed ok in ([0-9:]+).*', '\\1', .) %>%
    strsplit(':') %>% extract2(1L) %>% extract(-1L) %>%
    as.numeric %>% (function(x) {x[1L] + x[2L]/60})
}) %>% mean

Average duration from testing 10 times on version currently at master:

# [1] 1.818333

Compared to this branch:

# [1] 1.758333

About a 3% improvement... pretty damn good if you ask me! Certainly more than I expected.

(I do have many of the external packages; perhaps we should add a testDeps argument to test.data.table?)

Loading required package: bit64
Loading required package: bit
Loading required package: knitr
Loading required package: ggplot2
Loading required package: plyr
Loading required package: reshape2
Loading required package: nlme
Loading required package: curl
Loading required package: rmarkdown
Loading required package: parallel

@mattdowle
Copy link
Member

mattdowle commented Jan 18, 2018

Nice result. What's the improvement in the min not mean, though (perhaps even better!)
Why not just time test.data.table() with n=10 using microbenchmark? Its runtime is the same as the time it prints at the end isn't it? It's important to have the gc() run by system.time, especially here where the change is something related to many small objects. Otherwise gc() will run a bit more randomly and the time to gc() is included in your replicate. Probably gc() happens so often in the test suite that it makes no difference, but easier to rule that out.
Yes : testDeps added to test.data.table would be great.

@MichaelChirico
Copy link
Member Author

Certainly easier to do, but I wasn't sure if there's any overhead in test.data.table() that's skipped by its internal timer

@mattdowle
Copy link
Member

mattdowle commented Jan 18, 2018

I'd be surprised if the overhead is more than negligible. Easy to check...

> system.time(test.data.table())
...
All 7596 tests in inst/tests/tests.Rraw completed ok in 00:01:23
   user  system elapsed 
 85.512   0.716  82.933 

00:01:23 == 82.933

@MichaelChirico
Copy link
Member Author

Hmm, slightly different for me:

All 7661 tests in inst/tests/tests.Rraw completed ok in 00:01:48 on Thu Jan 18 16:46:12 2018
   user  system elapsed 
166.901   7.128 109.743 

anyway i'll run with microbenchmark

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 18, 2018

Not as impressive this time (<1%):

Unit: seconds
            expr      min       lq     mean   median      uq     max neval
 explicitInteger  96.4129  97.7939 101.9369 101.4272 105.417 108.409    10
          master 100.3113 100.9985 102.4299 102.3071 103.081 106.099    10

(cobbled together from microbenchmark runs on separate sessions)

}
stop("problem recycling column ",i,", try a simpler type")
# }
stop("argument ",i," (nrow ",nrows[i],") cannot be recycled without remainder to match longest nrow (",nr,")")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As near as I can tell this error is superseded by the above warning? Also eliminated the commented branch since it seems to be vestigial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. +1

@@ -22,18 +22,18 @@ inrange <- function(x,lower,upper,incbounds=TRUE) {
subject = setDT(list(l=lower, u=upper))
ops = if (incbounds) c(4L, 2L) else c(5L, 3L) # >=,<= and >,<
verbose = getOption("datatable.verbose")
if (verbose) {last.started.at=proc.time()[3];cat("forderv(query) took ... ");flush.console()}
Copy link
Member Author

@MichaelChirico MichaelChirico Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle what about two functions, timestart and timetaken? timestart would accept a message as input, cat it, flush.console()s and return the proc.time(); timetaken takes the result of timestart, cats it, and flush.console(), and returns nothing.

(also perhaps this should be done as a separate PR?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes separate PR. Sometimes there are several nested timing blocks in play. Storing the started.at time in a variable and then passing that (or the appropriate one) to timetaken I think I like the balance of using a common function along with the flexibility of using it in different ways depending on the circumstances. i.e. current timetaken() approach, but actually using it here. IIUC.

@mattdowle
Copy link
Member

On the contrary, I'd say the min and lq figures confirm the 3%-4% improvement. The other figures are affected too much by a varying number of gc()'s or effects like that. Given that the test suite has a large number of small arbitrary tests and we didn't think at all about an appropriate test, it's impressive any improvement shows up at all. Take a particular test that hits those the changed lines, and repeat that one test many times and the improvement could be larger (but I'm not suggested doing that, no need). In R-devel the improvement may be greater by avoiding the small creation at all (which wasn't possible before as a type double).

@mattdowle
Copy link
Member

mattdowle commented Jan 23, 2018

Looks like it's a correct fail on test 1187.6 I see locally. I'll leave to you.

@MichaelChirico MichaelChirico force-pushed the explicitIntegers branch 6 times, most recently from 97b1f47 to bf9fdf6 Compare January 24, 2018 08:43
R/foverlaps.R Outdated
@@ -128,24 +129,15 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
}
# nomatch has no effect here, just for passing arguments consistently to `bmerge`
.Call(Clookup, uy, nrow(y), indices(uy, y, yintervals, nomatch=0L, roll=roll), maxgap, minoverlap, mult, type, verbose)
if (maxgap == 0L && minoverlap == 1L) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunsrinivasan check out this diff. axed iintervals as it appears unused. And axed the extra argument checks to simplify so long as the other options are unimplemented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico good catch regarding iintervals. Seems like it wasn't used right from the first commit of foverlaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you simplified code, but I left it there to serve as a placeholder when I'll come back to work on it... The first if statement would always be run currently. So that's the only cost here, which is fine.. Would it be possible to revert this part to how it was please?

The PR overall looks great! Apologies for not being able to write earlier.

Copy link
Member Author

@MichaelChirico MichaelChirico Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunsrinivasan I figured that's why you left it like that... only changed the logic to make testing easier, Codecov was complaining since it's impossible to get to the other branches to test them (all hail the mighty Codecov 🌞 ⛪️ ). I figured it'd be easy enough to revert to the old code when needed (that's what git's for right?)

as a compromise, I could leave the old code in, just commented out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not wrap those lines with # nocov start and # nocov end so as to skip coverage tests? See https://github.com/Rdatatable/data.table/blob/70db3e48738026388c6487e83ee902ad37256ca5/R/c.factor.R

Yes, commented out with a comment pointing to this PR would be okay too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply because i didn't know that was possible! I'll add that to onAttach and onLoad as well... thanks!

@MichaelChirico
Copy link
Member Author

@mattdowle OK I think I've hammered away enough at the codecov diff miss for now. AFAICT the major remaining miss is from changes to onAttach.R, which I don't think (?) are testable.

@mattdowle mattdowle added this to the v1.10.6 milestone Jan 30, 2018
Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic!
This could help memory usage and gc() time, too (not just the 3-4% you found), by avoiding all those length 1 numeric vectors. Anyway, it puts data.table in the best position to benefit from the global-small-integer optimization on the way in R itself.
And increase of coverage by 0.33% (from 91.48% to 91.81%) is worth it by itself.

I've requested review from @arunsrinivasan re the foverlaps changes. Maybe those removed lines could be captured in an issue to implement or as a comment, if not already.

Copy link
Member

@arunsrinivasan arunsrinivasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico overall great PR! I've commented under foverlaps.

Apologies for not replying earlier.

R/foverlaps.R Outdated
@@ -15,7 +15,7 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
mult = match.arg(mult)
if (type == "equal")
stop("type = 'equal' is not implemented yet. But note that this is just the same as a normal data.table join y[x, ...], unless you are also interested in setting 'minoverlap / maxgap' arguments. But those arguments are not implemented yet as well.")
if (maxgap > 0L || minoverlap > 1L)
if (maxgap != 0L || minoverlap != 1L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxgap < 0 and minoverlap < 1 both are checked in lines 6 and 8. What's the purpose of this change?

R/foverlaps.R Outdated
setkey(uy)[, `:=`(lookup = list(list(integer(0))), type_lookup = list(list(integer(0))), count=0L, type_count=0L)]
if (verbose) {cat(round(proc.time()[3]-last.started.at,3),"secs\n");flush.console}
setkey(uy)[, `:=`(lookup = list(list(integer(0L))), type_lookup = list(list(integer(0L))), count=0L, type_count=0L)]
if (verbose) {cat(timetaken(last.started.at)); flush.console()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to how these changes are related to this PR (1 -> 1L like enhancements). But I'm guessing you've discussed these changes with Matt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunsrinivasan the Codecov report was failing pretty miserably on the PR, since it seems to have uncovered a lot of un-tested lines (this one in particular would have come up from 3 becoming 3L, for example

R/foverlaps.R Outdated
@@ -128,24 +129,15 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
}
# nomatch has no effect here, just for passing arguments consistently to `bmerge`
.Call(Clookup, uy, nrow(y), indices(uy, y, yintervals, nomatch=0L, roll=roll), maxgap, minoverlap, mult, type, verbose)
if (maxgap == 0L && minoverlap == 1L) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico good catch regarding iintervals. Seems like it wasn't used right from the first commit of foverlaps.

R/foverlaps.R Outdated
@@ -128,24 +129,15 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
}
# nomatch has no effect here, just for passing arguments consistently to `bmerge`
.Call(Clookup, uy, nrow(y), indices(uy, y, yintervals, nomatch=0L, roll=roll), maxgap, minoverlap, mult, type, verbose)
if (maxgap == 0L && minoverlap == 1L) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you simplified code, but I left it there to serve as a placeholder when I'll come back to work on it... The first if statement would always be run currently. So that's the only cost here, which is fine.. Would it be possible to revert this part to how it was please?

The PR overall looks great! Apologies for not being able to write earlier.

Michael Chirico added 5 commits February 1, 2018 14:53
Merge branch 'master' of https://github.com/Rdatatable/data.table into explicitIntegers

# Conflicts:
#	R/data.table.R
#	inst/tests/tests.Rraw
Merge branch 'explicitIntegers' of https://github.com/Rdatatable/data.table into explicitIntegers

# Conflicts:
#	inst/tests/tests.Rraw
@@ -1,10 +1,10 @@

# nocov start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a nocov start here? Surely test.data.table is tested; it's the main entry point to the test suite!

@@ -56,6 +61,7 @@ compactprint <- function(DT, topn=2) {
invisible()
}

# nocov start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here. INT() is used in lots of tests. Why exclude it from code cov?

R/foverlaps.R Outdated
if (verbose) {cat(timetaken(last.started.at));flush.console()}
olaps = .Call(Coverlaps, uy, xmatches, mult, type, nomatch, verbose)
}
# nocov start
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

… newline in its return value; e.g. at the end of tests.Rraw it's used assuming no newline included
@mattdowle mattdowle merged commit 3f835d4 into master Feb 6, 2018
@mattdowle mattdowle deleted the explicitIntegers branch February 6, 2018 02:01
@MichaelChirico
Copy link
Member Author

@mattdowle thanks for the detailed review, i learned a lot about codecov here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants