Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set automatically allocates new column slots if needed #5269

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
53d5160
add re-overallocation to set
ben-schwen Nov 21, 2021
a035fbe
refine test
ben-schwen Nov 21, 2021
315eb3c
avoid double substitute by calling Calloccolwrapper directly
ben-schwen Nov 21, 2021
b1db2ab
add length check to avoid chmatch
ben-schwen Nov 21, 2021
7146873
add news
ben-schwen Nov 21, 2021
67a0218
alter old test
ben-schwen Nov 21, 2021
6b35f8e
merge master
ben-schwen Nov 21, 2021
cda7455
add coverage
ben-schwen Nov 21, 2021
b61e7ff
remove old error which is not reachable anymore
ben-schwen Nov 21, 2021
517c62a
grammar
MichaelChirico Dec 19, 2021
7469494
add issue number to tests
ben-schwen Dec 19, 2021
e9d0e9c
remove 2nd eval
ben-schwen Dec 20, 2021
ca86485
merge master
ben-schwen Dec 20, 2021
592853d
move set to c level
ben-schwen Dec 26, 2021
86dfc42
proper use of alloccol
ben-schwen Dec 26, 2021
3eebecf
update test
ben-schwen Dec 26, 2021
1aa5534
rewrite R call
ben-schwen Dec 27, 2021
249a471
fix tests
ben-schwen Dec 27, 2021
f65a8cc
exchange alloccol with shallow
ben-schwen Dec 27, 2021
f95c1c4
add test for TRUELENGTH=0 set
ben-schwen Dec 27, 2021
e4521a1
rewrite set
ben-schwen Sep 11, 2024
03833c9
Merge branch 'master' into set_re-overallocate
ben-schwen Sep 11, 2024
ad8c97d
fix typo
ben-schwen Sep 11, 2024
6fd0907
add tests
ben-schwen Sep 11, 2024
4a96f80
change test numbering
ben-schwen Sep 11, 2024
074b805
add test
ben-schwen Sep 11, 2024
26cdf8f
fix inherits
ben-schwen Sep 11, 2024
c4bdbea
Merge branch 'master' into set_re-overallocate
MichaelChirico Jan 17, 2025
c4174e7
style
MichaelChirico Jan 17, 2025
23b1c9d
bullet number
MichaelChirico Jan 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ rowwiseDT(

19. Grouped queries on keyed tables no longer return an incorrectly keyed result if the _ad hoc_ `by=` list has some function call (in particular, a function which happens to return a strictly decreasing function of the keys), e.g. `by=.(a = rev(a))`, [#5583](https://github.com/Rdatatable/data.table/issues/5583). Thanks @AbrJA for the report and @MichaelChirico for the fix.

20. `set()` now automatically pre-allocates new column slots if needed, similar to what `:=` already does, [#496](https://github.com/Rdatatable/data.table/issues/496) [#4100](https://github.com/Rdatatable/data.table/issues/4100). Thanks to Huashan Chen and @tyner for the report and Benjamin Schwendinger for the fix.

## NOTES

1. There is a new vignette on joins! See `vignette("datatable-joins")`. Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: [#2181](https://github.com/Rdatatable/data.table/issues/2181).
Expand Down
11 changes: 8 additions & 3 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2758,9 +2758,14 @@ setcolorder = function(x, neworder=key(x), before=NULL, after=NULL, skip_absent=
invisible(x)
}

set = function(x,i=NULL,j,value) # low overhead, loopable
{
.Call(Cassign,x,i,j,NULL,value)
set = function(x, i=NULL, j, value) {
name = as.character(substitute(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this open the same can of worms that we're seeing with setDT()? #6706, #6723, ...

We need to consider set(env$nm, ...) and a whole litany of other ways set() can be used on set(<call>, ...) besides as set(<name>, ...), right?

old_add = address(x)
x = .Call(Cassign, x, i, j, NULL, value)
if (old_add != address(x)) {
# assign is needed to replace x on address change due to possible new allocation
assign(name, x, envir=parent.frame())
}
invisible(x)
}

Expand Down
14 changes: 12 additions & 2 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -1325,7 +1325,8 @@ test(416, DT[J(2)], error="the columns to join by must be specified using")

# Test shallow copy verbose message from := adding a column, and (TO DO) only when X is NAMED.
DT = data.table(a=1:3,b=4:6)
test(417, alloc.col(DT,3,verbose=TRUE), DT, output="Attempt to reduce allocation from.*to 5 ignored. Can only increase allocation via shallow copy")
test(417.1, alloc.col(DT,3,verbose=TRUE), DT, output="Attempt to reduce allocation from.*to 5 ignored. Can only increase allocation via shallow copy")
test(417.2, alloc.col(DT,3,verbose=1L), error="verbose must be TRUE or FALSE")
options(datatable.alloccol=1L)
DT = data.table(a=1:3,b=4:6)
options(datatable.alloccol=1024L)
Expand Down Expand Up @@ -14717,7 +14718,7 @@ test(2016.1, name, "DT")
test(2016.2, DT, data.table(a=1:3))
test(2016.3, DT[2,a:=4L], data.table(a=INT(1,4,3))) # no error for := when existing column
test(2016.4, set(DT,3L,1L,5L), data.table(a=INT(1,4,5))) # no error for set() when existing column
test(2016.5, set(DT,2L,"newCol",5L), error="either been loaded from disk.*or constructed manually.*Please run setDT.*setalloccol.*on it first") # just set()
test(2016.5, set(DT,2L,"newCol",5L), data.table(a=INT(1,4,5), newCol=INT(NA,5L,NA))) # works since set overallocates #4100
test(2016.6, DT[2,newCol:=6L], data.table(a=INT(1,4,5), newCol=INT(NA,6L,NA))) # := ok (it changes DT in caller)
unlink(tt)

Expand Down Expand Up @@ -20764,3 +20765,12 @@ test(2303.1, DT[, .N, by=.(b=rev(a))], data.table(b=2:1, N=1L))
test(2303.2, DT[, .(N=1L), by=.(b=rev(a))], data.table(b=2:1, N=1L)) # ensure no interaction with GForce
DT = data.table(a=2:3, b=1:0, key=c('a', 'b'))
test(2303.3, DT[, .N, by=.(ab=a^b, d=c(1L, 1L))], data.table(ab=c(2, 1), d=1L, N=1L))

# re-overallocate in set if quota is reached #496 #4100
DT = data.table()
test(2304.1, options=c(datatable.alloccol=1L), {for (i in seq(10L)) set(DT, j = paste0("V",i), value = i); ncol(DT)}, 10L)
DT = structure(list(a = 1, b = 2), class = c("data.table", "data.frame"))
test(2304.2, options=c(datatable.alloccol=1L), set(DT, j="c", value=3), data.table(a=1, b=2, c=3))
# ensure := and set are consistent if they need to overallocate
DT = data.table(); DT2 = data.table()
test(2304.3, options=c(datatable.alloccol=1L), {for (i in seq(10L)) set(DT, j = paste0("V",i), value = i)}, {for (i in seq(10)) DT2[, sprintf("V%d",i) := i]})
23 changes: 12 additions & 11 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values)
// newcolnames : add these columns (if any)
// cols : column names or numbers corresponding to the values to set
// rows : row numbers to assign
R_len_t numToDo, targetlen, vlen, oldncol, oldtncol, coln, protecti=0, newcolnum, indexLength;
R_len_t numToDo, targetlen, vlen, oldncol, tl, coln, protecti=0, newcolnum, indexLength;
SEXP targetcol, nullint, s, colnam, tmp, key, index, a, assignedNames, indexNames;
bool verbose=GetVerbose();
int ndelete=0; // how many columns are being deleted
Expand Down Expand Up @@ -514,22 +514,23 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values)
// modify DT by reference. Other than if new columns are being added and the allocVec() fails with
// out-of-memory. In that case the user will receive hard halt and know to rerun.
if (length(newcolnames)) {
oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.

if (oldtncol<oldncol) {
if (oldtncol==0) error(_("This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.")); // #2996
internal_error(__func__, "oldtncol(%d) < oldncol(%d)", oldtncol, oldncol); // # nocov
tl = TRUELENGTH(dt);
int overAlloc = checkOverAlloc(GetOption(install("datatable.alloccol"), R_NilValue));
// perform re-overallocation since DT has too less columns #1831 #4100
if (tl < oldncol+LENGTH(newcolnames)) {
// use MAX to mitigate case where data is loaded from disk (readRDS()/load()) or constructed manually (e.g. using structure())
dt = shallow(dt,R_NilValue, MAX(oldncol-tl, 0) + oldncol + length(newcolnames) + overAlloc);
tl = TRUELENGTH(dt);
names = PROTECT(getAttrib(dt, R_NamesSymbol)); protecti++;
}
if (oldtncol>oldncol+10000L) warning(_("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo()."),oldtncol, oldncol);
if (oldtncol < oldncol+LENGTH(newcolnames))
internal_error(__func__, "input dt has not been allocated enough column slots. l=%d, tl=%d, adding %d", oldncol, oldtncol, LENGTH(newcolnames)); // # nocov
if (tl>oldncol+10000L) warning(_("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo()."),tl, oldncol);
if (!selfrefnamesok(dt,verbose))
error(_("It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to data.table issue tracker.")); // # nocov
// Can growVector at this point easily enough, but it shouldn't happen in first place so leave it as
// strong error message for now.
else if (TRUELENGTH(names) != oldtncol)
else if (TRUELENGTH(names) != tl)
// Use (long long) to cast R_xlen_t to a fixed type to robustly avoid -Wformat compiler warnings, see #5768, PRId64 didn't work
internal_error(__func__, "selfrefnames is ok but tl names [%lld] != tl [%d]", (long long)TRUELENGTH(names), oldtncol); // # nocov
internal_error(__func__, "selfrefnames is ok but tl names [%lld] != tl [%d]", (long long)TRUELENGTH(names), tl); // # nocov
if (!selfrefok(dt, verbose)) // #6410 setDT(dt) and subsequent attr<- can lead to invalid selfref
error(_("It appears that at some earlier point, attributes of this data.table have been reassigned. Please use setattr(DT, name, value) rather than attr(DT, name) <- value. If that doesn't apply to you, please report your case to the data.table issue tracker."));
SETLENGTH(dt, oldncol+LENGTH(newcolnames));
Expand Down
Loading