diff --git a/NEWS.md b/NEWS.md index bcd5743ed..f0ddda29e 100644 --- a/NEWS.md +++ b/NEWS.md @@ -131,28 +131,7 @@ rowwiseDT( 18. Grouped queries on keyed tables no longer return an incorrectly keyed result if the _ad hoc_ `by=` list has some function call (in particular, a function which happens to return a strictly decreasing function of the keys), e.g. `by=.(a = rev(a))`, [#5583](https://github.com/Rdatatable/data.table/issues/5583). Thanks @AbrJA for the report and @MichaelChirico for the fix. -19. Assigning `list(NULL)` to a list column now replaces the column with `list(NULL)`, instead of deleting the column [#5558](https://github.com/Rdatatable/data.table/issues/5558). This behavior is now consistent with base `data.frame`. Thanks @tdhock for the report and @joshhwuu for the fix. This is due to a fundamental ambiguity from both allowing list columns _and_ making the use of `list()` to wrap `j=` arguments optional. We think that the code behaves as expected in all cases now. See the below for some illustration: - - ```r - DT = data.table(L=list(1L), i=2L, c='a') - - DT[, i := NULL] # delete i - DT[, L := NULL] # delete L - - DT[, i := list(NULL)] # overwrite: identical(DT$i, list(NULL)) - # ^ ** THIS IS A CHANGE FROM PREVIOUS BEHAVIOR WHICH WOULD DELETE i ** - DT[, L := list(NULL)] # assignment: identical(DT$L, list(NULL)) - - DT[, i := .(3L)] # assignment: identical(DT$i, 3L) - DT[, i := .('a')] # overwrite: identical(DT$i, 'a') - DT[, L := .(list(NULL))] # assignment: identical(DT$L, list(NULL)) - - DT[, c('L', 'i') := list(NULL, NULL)] # delete L,i - DT[, c('L', 'i') := list(list(NULL), 3L)] # assignment: identical(DT$L, list(NULL)), identical(DT$i, 3L) - DT[, c('L', 'i') := list(NULL, 3L)] # delete L, assign to i - DT[, c('L', 'i') := list(list(NULL), NULL)] # assign to L, delete i - ``` -20. An integer overflow in `fread()` with lines longer than `2^(31/2)` bytes is prevented, [#6729](https://github.com/Rdatatable/data.table/issues/6729). The typical impact was no worse than a wrong initial allocation size, corrected later. Thanks to @TaikiSan21 for the report and @aitap for the fix. +19. An integer overflow in `fread()` with lines longer than `2^(31/2)` bytes is prevented, [#6729](https://github.com/Rdatatable/data.table/issues/6729). The typical impact was no worse than a wrong initial allocation size, corrected later. Thanks to @TaikiSan21 for the report and @aitap for the fix. ## NOTES diff --git a/R/data.table.R b/R/data.table.R index 5c95af965..eeb9a1ca2 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -1145,8 +1145,7 @@ replace_dot_alias = function(e) { } } names(jsub)="" - # dont wrap the RHS in list if it is a singular NULL and if not creating a new column - if (length(jsub[-1L]) == 1L && as.character(jsub[-1L]) == 'NULL' && all(lhs %chin% names_x)) jsub[[1L]]=as.name("identity") else jsub[[1L]]=as.name("list") + jsub[[1L]]=as.name("list") } av = all.vars(jsub,TRUE) if (!is.atomic(lhs)) stopf("LHS of := must be a symbol, or an atomic vector (column names or positions).") diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index cf1aa3d89..3a16d8ff6 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -15447,7 +15447,7 @@ L = list(1:3, NULL, 4:6) test(2058.18, length(L), 3L) test(2058.19, as.data.table(L), data.table(V1=1:3, V2=4:6)) # V2 not V3 # no DT = data.table(a=1:3, b=c(4,5,6)) -test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3, b=list(NULL))) # no +test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3)) # no # rbindlist improved error message, #3638 DT = data.table(a=1) @@ -20782,29 +20782,33 @@ test(2303.2, DT[, .(N=1L), by=.(b=rev(a))], data.table(b=2:1, N=1L)) # ensure no DT = data.table(a=2:3, b=1:0, key=c('a', 'b')) test(2303.3, DT[, .N, by=.(ab=a^b, d=c(1L, 1L))], data.table(ab=c(2, 1), d=1L, N=1L)) +# NB: these tests have been edited in light of #6740 to be regression tests +# preventing existing behavior from breaking in 1.17.0 while we decide +# whether a breaking change is warranted & how to proceed. The specific tests +# with different behavior under #5558 001,002,005,006,011,012,015,016. # tests for new consistent replacement of list columns with list(NULL), #5558 # replacement of a list column with list(NULL) in a single-row data.table, using different assignment methods DT = data.table(L=list("A"), i=1L) ans = data.table(L=list(NULL), i=1L) # test using replacement with $ operator DT$L = list(NULL) -test(2304.001, DT, ans) +test(2304.001, DT, within(ans, rm('L'))) DT = data.table(L=list("A"), i=1L) # standard form with := operator -test(2304.002, copy(DT)[, L := list(NULL)], ans) +test(2304.002, copy(DT)[, L := list(NULL)], within(ans, rm('L'))) # functional form with := operator test(2304.003, copy(DT)[, `:=`(L=list(NULL))], ans) # functional form with 'let' alias test(2304.004, copy(DT)[, let(L=list(NULL))], ans) # using set() -test(2304.005, set(copy(DT), j="L", value=list(NULL)), ans) +test(2304.005, set(copy(DT), j="L", value=list(NULL)), within(ans, rm('L'))) # replacement of multiple list columns with list(NULL) in a single-row data.table, using different assignment methods DT = data.table(L1=list("A"), L2=list("B"), i=1L) ans = data.table(L1=list(NULL), L2=list(NULL), i=1L) DT$L1 = list(NULL) DT$L2 = list(NULL) -test(2304.006, DT, ans) +test(2304.006, DT, within(ans, rm('L1', 'L2'))) DT = data.table(L1=list("A"), L2=list("B"), i=1L) # standard form with := operator test(2304.007, copy(DT)[, c("L1", "L2") := list(list(NULL), list(NULL))], ans) @@ -20820,23 +20824,23 @@ DT = data.table(L=list("A", "B"), i=1L) ans = data.table(L=list(NULL, NULL), i=1L) # test using replacement with $ operator DT$L = list(NULL) -test(2304.011, DT, ans) +test(2304.011, DT, within(ans, rm('L'))) DT = data.table(L=list("A", "B"), i=1L) # standard form with := operator -test(2304.012, copy(DT)[, L := list(NULL)], ans) +test(2304.012, copy(DT)[, L := list(NULL)], within(ans, rm('L'))) # functional form with := operator test(2304.013, copy(DT)[, `:=`(L=list(NULL))], ans) # functional form with 'let' alias test(2304.014, copy(DT)[, let(L=list(NULL))], ans) # using set() -test(2304.015, set(copy(DT), j="L", value=list(NULL)), ans) +test(2304.015, set(copy(DT), j="L", value=list(NULL)), within(ans, rm('L'))) # replacement of multiple list columns with list(NULL) in a multi-row data.table, using different assignment methods DT = data.table(L1=list("A", "B"), L2=list("B", "C"), i=1L) ans = data.table(L1=list(NULL, NULL), L2=list(NULL, NULL), i=1L) DT$L1 = list(NULL) DT$L2 = list(NULL) -test(2304.016, DT, ans) +test(2304.016, DT, within(ans, rm('L1', 'L2'))) DT = data.table(L1=list("A", "B"), L2=list("B", "C"), i=1L) # standard form with := operator test(2304.017, copy(DT)[, c("L1", "L2") := list(list(NULL), list(NULL))], ans) diff --git a/man/assign.Rd b/man/assign.Rd index e1d4751f3..71e954230 100644 --- a/man/assign.Rd +++ b/man/assign.Rd @@ -78,8 +78,6 @@ When \code{LHS} is a factor column and \code{RHS} is a character vector with ite Unlike \code{<-} for \code{data.frame}, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given when fractional data is truncated. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then \emph{plonked} into that column slot and we call this \emph{plonk syntax}, or \emph{replace column syntax} if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening and it is clearer to readers of your code that you really do intend to change the column type; e.g., \code{DT[, colA:=as.integer(colA)]}. A plonk occurs whenever you provide a RHS value to `:=` which is \code{nrow} long. When a column is \emph{plonked}, the original column is not updated by reference because that would entail updating every single element of that column whereas the plonk is just one column pointer update. \code{data.table}s are \emph{not} copied-on-change by \code{:=}, \code{setkey} or any of the other \code{set*} functions. See \code{\link{copy}}. - -While in most cases standard and functional form of \code{:=} are interchangeable, there are some minor differences in the way that \code{RHS} is handled. In the functional form, \code{:=} operator behaves like an alias to \code{list}. This means that when \code{RHS} is a list, \code{LHS} is assigned a list. Avoid this by using the standard form when \code{RHS} is a list, or use a vector. See \href{../doc/datatable-reference-semantics.html}{\code{vignette("datatable-reference-semantics")}} for examples. } \section{Advanced (internals):}{It is easy to see how \emph{sub-assigning} to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the \code{data.table} can be grown \emph{by reference}: the list vector of column pointers is \emph{over-allocated}, see \code{\link{truelength}}. By defining \code{:=} in \code{j} we believe update syntax is natural, and scales, but it also bypasses \code{[<-} dispatch and allows \code{:=} to update by reference with no copies of any part of memory at all. diff --git a/src/assign.c b/src/assign.c index 6bb8b8df1..e5cc9b84a 100644 --- a/src/assign.c +++ b/src/assign.c @@ -549,11 +549,6 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values) for (int i=0; i