Implement `join_by()` #222

talegari · 2021-04-01T05:14:37Z

Hi Mark,

Consider adding rolling joins for left and right joins.
data.table has this elegant function missing in dplyr.

Code wise, we just require to add roll = TRUE (by default this can be kept FALSE).
For example, we need to change from return_df <- y[x, on = on_vec, allow.cartesian = TRUE] to return_df <- y[x, on = on_vec, allow.cartesian = TRUE, roll = roll] (from left_join.) with roll being a new argument.

Let me know if you would like me submitting a PR.

The text was updated successfully, but these errors were encountered:

markfairbanks · 2021-04-01T15:47:10Z

In general I'm open to this but I'll be honest - I don't know much about rolling joins.

A few things:

Is allow.cartesian always set to TRUE for rolling joins? From the link in your comment I don't see allow.cartesian set anywhere, which means the default allow.cartesian = FALSE would be used.
Do rolling joins only apply to left and right joins?
I also see that there is a rollends option in data.table - is that something that should be added at the same time as roll? Or is that for something else?

talegari · 2021-04-05T08:22:50Z

Answering each question in order:

allow.cartesian should be set to FALSE for rolling joins as the number of rows in the result of the join is not greater than max(nrow(X), nrow(Y)) where X and Y are the dataframes to be joined.
Yes, rolling joins are generalizations of left/right joins only, by construct.
rollends help with handling missing values which appear at the ends. This goes with roll implementation.

Edit: data.table's documentation reference

talegari · 2021-04-05T09:05:33Z

Let me know what is your naming convention preference for the API and I will go head and submit a PR.

Here is the plan for left_join.:

left_join. <- function(x, y, by = NULL, roll = NULL, rollends = NULL){
    
    stopifnot(length(roll) == 1)
    stopifnot(is.character(roll) || is.numeric(roll))
    
    # do all prelim operations like creating `all_names`, `on_vec`
    # ...
    
    if (is.null(roll)){
        # go with regular left_join
        return_df = y[x, on = on_vec, allow.cartesian = TRUE]
    } else {
        # locf: last obs carried forward (TRUE, + Inf)
        # nocb: next obs caried backward (-Inf)
        # nearest: nearest
        if (is.character(roll)){
            roll %in% c("locf", "nocb", "nearest")
            if (roll == "locf"){
                roll_value = Inf
            } else if (roll == "nocb"){
                roll_value = -Inf
            } else {
                roll_value = "nearest"
            }
        }
        
        # use data.table's defaults when rollends is not specified
        if (is.null(rollends)){
            if (roll_value == "nearest"){
                rollends_value = c(TRUE,TRUE)
            } else if (roll_value >= 0){
                rollends_value = c(FALSE,TRUE)
            } else {
                rollends_value = c(TRUE,FALSE)
            }
        } else {
            stopifnot(is.logical(rollends))
            stopifnot(between(length(rollends), 1, 2))
            rollends_value = rollends
        }
        
        return_df <- y[x
                       , on = on_vec
                       , allow.cartesian = FALSE
                       , roll = roll_value
                       , rollends = rollends_value
                       ]
    }
    
    # apply post-processing ops like setting names, column order
    # ....
    return(as_tidytable(return_df))
        
}

markfairbanks · 2021-04-05T15:24:06Z

Thanks for taking a look at this.

Let me know what is your naming convention preference for the API and I will go head and submit a PR.

Let me think about this one for a bit.

markfairbanks · 2021-05-21T21:28:35Z

As I've thought about this I'm going to hold off adding this to tidytable for now. I think it's a good idea to wait for the tidyverse team to create their version and then just mimic their syntax.

In the meantime I'll keep this issue open, as it's definitely functionality that will be in tidytable at some point.

markfairbanks · 2022-05-09T19:18:53Z

tidyverse/dplyr#5910 was just merged into dplyr. I haven't looked too much into the syntax but there's a bunch of new functionality there.

The documentation here seems to cover everything.

Along with join_by there appears to be preceding/following/between/within/overlaps - all of which can be used within join_by.

Edit: Looks like they got rid of preceding/following and now only have closest

I'm guessing this will be doable for left/right/inner joins. I'm guessing it won't work for full joins because the translation isn't quite as direct in data.table but I could be wrong.

markfairbanks · 2022-09-27T01:42:39Z

This is going to be an interesting one to implement because data.table has left join specifications flipped from dplyr.

So the join_by() specification is flipped from on = .() in data.table.

pacman::p_load(tidytable)

sales <- tidytable(
  id = c(1L, 1L, 1L, 2L, 2L),
  sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)

promos <- tidytable(
  id = c(1L, 1L, 2L),
  promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)

# Match `id` to `id`, and `sale_date` to `promo_date`
by <- dplyr::join_by(id, sale_date == promo_date)
dplyr::left_join(sales, promos, by)
#> # A tidytable: 5 × 2
#>      id sale_date 
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

dt(promos, sales, on = .(id, promo_date == sale_date))
#> # A tidytable: 5 × 2
#>      id promo_date
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

Another note: There is a lot of renaming/internal prep done to make sure tidytable::left_join() matches dplyr::left_join() when a character vector is provided. It'll be tough to make that possible here. In the above example data.table returns the promo_date column where we need the sale_date column.

markfairbanks added the feature feature label Apr 1, 2021

talegari changed the title ~~Addling rolling join functionality for left and right joins~~ Adding rolling join functionality for left and right joins Apr 5, 2021

markfairbanks changed the title ~~Adding rolling join functionality for left and right joins~~ Implement join_by() Oct 3, 2022

markfairbanks linked a pull request Oct 18, 2022 that will close this issue

Sketch join_by() implementation #661

Draft

markfairbanks mentioned this issue Aug 14, 2023

Tidyverse 2 update #765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `join_by()` #222

Implement `join_by()` #222

talegari commented Apr 1, 2021

markfairbanks commented Apr 1, 2021 •

edited

Loading

talegari commented Apr 5, 2021 •

edited

Loading

talegari commented Apr 5, 2021 •

edited

Loading

markfairbanks commented Apr 5, 2021

markfairbanks commented May 21, 2021

markfairbanks commented May 9, 2022 •

edited

Loading

markfairbanks commented Sep 27, 2022 •

edited

Loading

Implement join_by() #222

Implement join_by() #222

Comments

talegari commented Apr 1, 2021

markfairbanks commented Apr 1, 2021 • edited Loading

talegari commented Apr 5, 2021 • edited Loading

talegari commented Apr 5, 2021 • edited Loading

markfairbanks commented Apr 5, 2021

markfairbanks commented May 21, 2021

markfairbanks commented May 9, 2022 • edited Loading

markfairbanks commented Sep 27, 2022 • edited Loading

Implement `join_by()` #222

Implement `join_by()` #222

markfairbanks commented Apr 1, 2021 •

edited

Loading

talegari commented Apr 5, 2021 •

edited

Loading

talegari commented Apr 5, 2021 •

edited

Loading

markfairbanks commented May 9, 2022 •

edited

Loading

markfairbanks commented Sep 27, 2022 •

edited

Loading