Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement join_by() #222

Open
talegari opened this issue Apr 1, 2021 · 7 comments · May be fixed by #661
Open

Implement join_by() #222

talegari opened this issue Apr 1, 2021 · 7 comments · May be fixed by #661
Labels
feature feature

Comments

@talegari
Copy link

talegari commented Apr 1, 2021

Hi Mark,

Consider adding rolling joins for left and right joins.
data.table has this elegant function missing in dplyr.

Code wise, we just require to add roll = TRUE (by default this can be kept FALSE).
For example, we need to change from return_df <- y[x, on = on_vec, allow.cartesian = TRUE] to return_df <- y[x, on = on_vec, allow.cartesian = TRUE, roll = roll] (from left_join.) with roll being a new argument.

Let me know if you would like me submitting a PR.

@markfairbanks
Copy link
Owner

markfairbanks commented Apr 1, 2021

In general I'm open to this but I'll be honest - I don't know much about rolling joins.

A few things:

  • Is allow.cartesian always set to TRUE for rolling joins? From the link in your comment I don't see allow.cartesian set anywhere, which means the default allow.cartesian = FALSE would be used.
  • Do rolling joins only apply to left and right joins?
  • I also see that there is a rollends option in data.table - is that something that should be added at the same time as roll? Or is that for something else?

@markfairbanks markfairbanks added the feature feature label Apr 1, 2021
@talegari
Copy link
Author

talegari commented Apr 5, 2021

Answering each question in order:

  1. allow.cartesian should be set to FALSE for rolling joins as the number of rows in the result of the join is not greater than max(nrow(X), nrow(Y)) where X and Y are the dataframes to be joined.
  2. Yes, rolling joins are generalizations of left/right joins only, by construct.
  3. rollends help with handling missing values which appear at the ends. This goes with roll implementation.

Edit: data.table's documentation reference

@talegari
Copy link
Author

talegari commented Apr 5, 2021

Let me know what is your naming convention preference for the API and I will go head and submit a PR.

Here is the plan for left_join.:

left_join. <- function(x, y, by = NULL, roll = NULL, rollends = NULL){
    
    stopifnot(length(roll) == 1)
    stopifnot(is.character(roll) || is.numeric(roll))
    
    # do all prelim operations like creating `all_names`, `on_vec`
    # ...
    
    if (is.null(roll)){
        # go with regular left_join
        return_df = y[x, on = on_vec, allow.cartesian = TRUE]
    } else {
        # locf: last obs carried forward (TRUE, + Inf)
        # nocb: next obs caried backward (-Inf)
        # nearest: nearest
        if (is.character(roll)){
            roll %in% c("locf", "nocb", "nearest")
            if (roll == "locf"){
                roll_value = Inf
            } else if (roll == "nocb"){
                roll_value = -Inf
            } else {
                roll_value = "nearest"
            }
        }
        
        # use data.table's defaults when rollends is not specified
        if (is.null(rollends)){
            if (roll_value == "nearest"){
                rollends_value = c(TRUE,TRUE)
            } else if (roll_value >= 0){
                rollends_value = c(FALSE,TRUE)
            } else {
                rollends_value = c(TRUE,FALSE)
            }
        } else {
            stopifnot(is.logical(rollends))
            stopifnot(between(length(rollends), 1, 2))
            rollends_value = rollends
        }
        
        return_df <- y[x
                       , on = on_vec
                       , allow.cartesian = FALSE
                       , roll = roll_value
                       , rollends = rollends_value
                       ]
    }
    
    # apply post-processing ops like setting names, column order
    # ....
    return(as_tidytable(return_df))
        
}

@talegari talegari changed the title Addling rolling join functionality for left and right joins Adding rolling join functionality for left and right joins Apr 5, 2021
@markfairbanks
Copy link
Owner

Thanks for taking a look at this.

Let me know what is your naming convention preference for the API and I will go head and submit a PR.

Let me think about this one for a bit.

@markfairbanks
Copy link
Owner

As I've thought about this I'm going to hold off adding this to tidytable for now. I think it's a good idea to wait for the tidyverse team to create their version and then just mimic their syntax.

In the meantime I'll keep this issue open, as it's definitely functionality that will be in tidytable at some point.

@markfairbanks
Copy link
Owner

markfairbanks commented May 9, 2022

tidyverse/dplyr#5910 was just merged into dplyr. I haven't looked too much into the syntax but there's a bunch of new functionality there.

The documentation here seems to cover everything.

Along with join_by there appears to be preceding/following/between/within/overlaps - all of which can be used within join_by.

Edit: Looks like they got rid of preceding/following and now only have closest

I'm guessing this will be doable for left/right/inner joins. I'm guessing it won't work for full joins because the translation isn't quite as direct in data.table but I could be wrong.

@markfairbanks
Copy link
Owner

markfairbanks commented Sep 27, 2022

This is going to be an interesting one to implement because data.table has left join specifications flipped from dplyr.

So the join_by() specification is flipped from on = .() in data.table.

pacman::p_load(tidytable)

sales <- tidytable(
  id = c(1L, 1L, 1L, 2L, 2L),
  sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)

promos <- tidytable(
  id = c(1L, 1L, 2L),
  promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)

# Match `id` to `id`, and `sale_date` to `promo_date`
by <- dplyr::join_by(id, sale_date == promo_date)
dplyr::left_join(sales, promos, by)
#> # A tidytable: 5 × 2
#>      id sale_date 
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

dt(promos, sales, on = .(id, promo_date == sale_date))
#> # A tidytable: 5 × 2
#>      id promo_date
#>   <int> <date>    
#> 1     1 2018-12-31
#> 2     1 2019-01-02
#> 3     1 2019-01-05
#> 4     2 2019-01-04
#> 5     2 2019-01-01

Another note: There is a lot of renaming/internal prep done to make sure tidytable::left_join() matches dplyr::left_join() when a character vector is provided. It'll be tough to make that possible here. In the above example data.table returns the promo_date column where we need the sale_date column.

@markfairbanks markfairbanks changed the title Adding rolling join functionality for left and right joins Implement join_by() Oct 3, 2022
@markfairbanks markfairbanks linked a pull request Oct 18, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants