Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set automatically allocates new column slots if needed #5269

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -544,6 +544,8 @@

51. `merge.data.table()` silently ignored the `incomparables` argument, [#2587](https://github.com/Rdatatable/data.table/issues/2587). It is now implemented and any other ignored arguments (e.g. misspellings) are now warned about. Thanks to @GBsuperman for the report and @ben-schwen for the fix.

52. `set()` now automatically pre-allocates new column slots if needed, similar to what `:=` already does, [#496](https://github.com/Rdatatable/data.table/issues/496) [#4100](https://github.com/Rdatatable/data.table/issues/4100). Thanks to Huashan Chen and Benjamin Tyner for the report and Benjamin Schwendinger for the fix.

## NOTES

1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
Expand Down
9 changes: 8 additions & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2715,7 +2715,14 @@ setcolorder = function(x, neworder=key(x), before=NULL, after=NULL) # before/af

set = function(x,i=NULL,j,value) # low overhead, loopable
{
.Call(Cassign,x,i,j,NULL,value)
if (is.character(j)) {
name = substitute(x)
x = .Call(Cassign,x,i,j,NULL,value)
if (is.name(name))
assign(as.character(name),x,parent.frame(),inherits=TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assign() looks pretty complicated to reason about, I would like to see more tests of robustness here (setDT() within a function, places where inherits=TRUE is important, etc.).

At a glance I can't tell why this is needed though -- it definitely warrants a comment explaining why we need to branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified now the whole code. I also added a comment why we need the assign.

Ultimately it would be nice to let set work also in the cases where it needs to overallocate and is inside a function, howeover, for making this to work we would need to find the right environment where to assign (might not be the first where x is present) and it is not clear whether it would be the last due to scoping

} else {
.Call(Cassign,x,i,j,NULL,value)
}
invisible(x)
}

Expand Down
14 changes: 12 additions & 2 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -1266,7 +1266,8 @@ test(416, DT[J(2)], error="the columns to join by must be specified using")

# Test shallow copy verbose message from := adding a column, and (TO DO) only when X is NAMED.
DT = data.table(a=1:3,b=4:6)
test(417, alloc.col(DT,3,verbose=TRUE), DT, output="Attempt to reduce allocation from.*to 5 ignored. Can only increase allocation via shallow copy")
test(417.1, alloc.col(DT,3,verbose=TRUE), DT, output="Attempt to reduce allocation from.*to 5 ignored. Can only increase allocation via shallow copy")
test(417.2, alloc.col(DT,3,verbose=1L), error="verbose must be TRUE or FALSE")
options(datatable.alloccol=1L)
DT = data.table(a=1:3,b=4:6)
options(datatable.alloccol=1024L)
Expand Down Expand Up @@ -14885,7 +14886,7 @@ test(2016.1, name, "DT")
test(2016.2, DT, data.table(a=1:3))
test(2016.3, DT[2,a:=4L], data.table(a=INT(1,4,3))) # no error for := when existing column
test(2016.4, set(DT,3L,1L,5L), data.table(a=INT(1,4,5))) # no error for set() when existing column
test(2016.5, set(DT,2L,"newCol",5L), error="either been loaded from disk.*or constructed manually.*Please run setDT.*setalloccol.*on it first") # just set()
test(2016.5, set(DT,2L,"newCol",5L), data.table(a=INT(1,4,5), newCol=INT(NA,5L,NA))) # works since set overallocates #4100
test(2016.6, DT[2,newCol:=6L], data.table(a=INT(1,4,5), newCol=INT(NA,6L,NA))) # := ok (it changes DT in caller)
unlink(tt)

Expand Down Expand Up @@ -18702,3 +18703,12 @@ test(2234.7, DT[, min(.SD), by=c(.I)], data.table(I=1L:5L, V1=c(1L, 2L, 3L, 2L,
test(2234.8, DT[, min(.SD), by=.I%%2L], error="by.*contains .I.*supported") # would be nice to support in future; i.e. by odd/even rows, and by=(.I+1L)%/%2L for pairs of rows; i.e. any expression of .I
test(2234.9, DT[, min(.SD), by=somefun(.I)], error="by.*contains .I.*supported")

# re-overallocate in set if quota is reached #496 #4100
opt = options(datatable.alloccol=1L)
n = getOption("datatable.alloccol")
DT = data.table()
for (i in seq(10*n)) set(DT, j = paste0("V",i), value = i)
test(2235.1, ncol(DT), 10L*n)
DT = structure(list(a = 1, b = 2), class = c("data.table", "data.frame"))
test(2235.2, set(DT, j="c", value=3), data.table(a=1, b=2, c=3))
options(opt)
23 changes: 12 additions & 11 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values)
// newcolnames : add these columns (if any)
// cols : column names or numbers corresponding to the values to set
// rows : row numbers to assign
R_len_t numToDo, targetlen, vlen, oldncol, oldtncol, coln, protecti=0, newcolnum, indexLength;
R_len_t numToDo, targetlen, vlen, oldncol, tl, coln, protecti=0, newcolnum, indexLength;
SEXP targetcol, nullint, s, colnam, tmp, key, index, a, assignedNames, indexNames;
bool verbose=GetVerbose();
int ndelete=0; // how many columns are being deleted
Expand Down Expand Up @@ -456,21 +456,22 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values)
// modify DT by reference. Other than if new columns are being added and the allocVec() fails with
// out-of-memory. In that case the user will receive hard halt and know to rerun.
if (length(newcolnames)) {
oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.

if (oldtncol<oldncol) {
if (oldtncol==0) error(_("This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.")); // #2996
error(_("Internal error: oldtncol(%d) < oldncol(%d). Please report to data.table issue tracker, including result of sessionInfo()."), oldtncol, oldncol); // # nocov
tl = TRUELENGTH(dt);
int overAlloc = checkOverAlloc(GetOption(install("datatable.alloccol"), R_NilValue));
// perform re-overallocation since DT has too less columns #1831 #4100
if (tl < oldncol+LENGTH(newcolnames)) {
// use MAX to mitigate case where data is loaded from disk (readRDS()/load()) or constructed manually (e.g. using structure())
dt = shallow(dt,R_NilValue, MAX(oldncol-tl, 0) + oldncol + length(newcolnames) + overAlloc);
tl = TRUELENGTH(dt);
names = PROTECT(getAttrib(dt, R_NamesSymbol)); protecti++;
}
if (oldtncol>oldncol+10000L) warning(_("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo()."),oldtncol, oldncol);
if (oldtncol < oldncol+LENGTH(newcolnames))
error(_("Internal error: DT passed to assign has not been allocated enough column slots. l=%d, tl=%d, adding %d"), oldncol, oldtncol, LENGTH(newcolnames)); // # nocov
if (tl>oldncol+10000L) warning(_("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo()."),tl, oldncol);
if (!selfrefnamesok(dt,verbose))
error(_("It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to data.table issue tracker.")); // # nocov
// Can growVector at this point easily enough, but it shouldn't happen in first place so leave it as
// strong error message for now.
else if (TRUELENGTH(names) != oldtncol)
error(_("Internal error: selfrefnames is ok but tl names [%d] != tl [%d]"), TRUELENGTH(names), oldtncol); // # nocov
else if (TRUELENGTH(names) != tl)
error(_("Internal error: selfrefnames is ok but tl names [%d] != tl [%d]"), TRUELENGTH(names), tl); // # nocov
SETLENGTH(dt, oldncol+LENGTH(newcolnames));
SETLENGTH(names, oldncol+LENGTH(newcolnames));
for (int i=0; i<LENGTH(newcolnames); ++i)
Expand Down