Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263

Merged
merged 12 commits into from
Nov 23, 2021

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Nov 19, 2021

Closes #5262
Closes #5037

merge.R

  • use set in merge.R instead of cbind

@codecov
Copy link

codecov bot commented Nov 19, 2021

Codecov Report

Merging #5263 (4030b94) into master (d8dc315) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5263      +/-   ##
==========================================
- Coverage   99.50%   99.50%   -0.01%     
==========================================
  Files          77       77              
  Lines       14605    14599       -6     
==========================================
- Hits        14533    14527       -6     
  Misses         72       72              
Impacted Files Coverage Δ
R/merge.R 100.00% <100.00%> (ø)
src/rbindlist.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8dc315...4030b94. Read the comment docs.

R/merge.R Outdated
yy = cbind(yy, x[tmp, othercolsx, with = FALSE])
nx = c(names(yy), names(x[tmp, othercolsx, with = FALSE]))
nx = make.unique(nx)
set(yy, NULL, tail(nx, -ncol(yy)), x[tmp, othercolsx, with = FALSE])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a noticeable memory advantage to looping over the RHS columns instead of creating the whole with=FALSE subset table?

it looks like the advantage is we can do the tmp subset once & apply it to all columns...

Copy link
Member Author

@ben-schwen ben-schwen Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guessed that saving this tmp object and doing indexing twice is preferred over using additional memory. Rethinking it again, I might have figured out what the ToDo meant initially where using set omits this object/indexing anyway.

@ben-schwen
Copy link
Member Author

ben-schwen commented Nov 19, 2021

@MichaelChirico whats the preferred process here: 1 PR per source file? Or 1 PR per todo? Or 1 megathread PR? And should I update the issue list of #678 or create new issues?

@ben-schwen ben-schwen added the WIP label Nov 19, 2021
@MichaelChirico
Copy link
Member

I think generally one PR per TODO makes the most sense, but if you see a cluster of TODOs in closely related code, do combine them. basically TODOs that are logically independent should get their own PRs

R/merge.R Outdated
yy = cbind(yy, x[tmp, othercolsx, with = FALSE])
nx = c(names(yy), paste0("V",seq_len(length(othercolsx))))
nx = make.unique(nx)
set(yy, NULL, tail(nx, -ncol(yy)), rep(list(NA), length(othercolsx)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be NA_integer_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, NA logical works fine here. rbindlist coerces types so logical columns of yywill be coerced to the types of dt. The reason behind the previous tmp used NA_integer_ is that it was used for indexing the data.table and therefore was integer.

@ben-schwen ben-schwen changed the title Closing ToDo's of #678 merge.R ToDo's of #678 Nov 20, 2021
@ben-schwen ben-schwen removed the WIP label Nov 20, 2021
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. I recall cbindlist in a PR has option for controlling copy behavior.

R/merge.R Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member

jangorecki commented Nov 21, 2021

Uh, sorry for late feedback but I realized nrow(y)=1 can be useful to test against as well. I can imagine that there could be different handling of a copy when you subset all rows and the index is scalar at the same time (based on DT[TRUE] doing shallow copy). Kind of edge case but worth to have it as well.

@ben-schwen
Copy link
Member Author

Uh, sorry for late feedback but I realized nrow(y)=1 can be useful to test against as well. I can imagine that there could be different handling of a copy when you subset all rows and the index is scalar at the same time (based on DT[TRUE] doing shallow copy). Kind of edge case but worth to have it as well.

Good point. I added test cases to check for accidental shallow copying.

AFAIU current behavior of shallow copying of #3215 (and friends) leads only to problems when pre-copy cols are altered in the copy since these cols point to the same address? Since we only add cols here and don't alter other cols in yy, I guessed there shouldn't be a problem here.

@@ -1875,6 +1877,16 @@ test(630.1, merge(DT1,DT2,all.x=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a"

test(631, merge(DT1,DT2,all.y=TRUE), data.table(a=c(2,3,5),total.x=c(NA,1,1),total.y=c(5,1,2),key="a"))
test(631.1, merge(DT1,DT2,all.y=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a",all.y=TRUE)),a))
# ensure merge(x,y,all.y) does not alter input y
# merge containing idx 1:nrow(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is a little unclear on its own -- ideally we can read the comment without any more context and know the point of the test. I think 'idx' refers to the implementation? best to be more explicit about what it means

@mattdowle mattdowle added this to the 1.14.3 milestone Nov 23, 2021
@mattdowle
Copy link
Member

mattdowle commented Nov 23, 2021

I tried to get to the root cause of this block of code. It took me a while to realize that all it was doing was adding on the right number of NA columns to yy so it could be rbind-ed using use.names=FALSE. So I made rbind support use.names=FALSE together with fill=TRUE and that block goes away.
I also removed the long standing comment about issue #24 after looking at it and thinking that comment is unlikely to be useful in the future now. (To document that I didn't accidentally delete that comment.)
Left the new tests in place and unchanged.

@mattdowle mattdowle changed the title merge.R ToDo's of #678 rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 Nov 23, 2021
@mattdowle mattdowle merged commit 4922384 into master Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

R/merge.R: # TO DO: use set() here instead.. rbindlist use.names = FALSE AND fill = TRUE
4 participants