This post has referred to a vignette from dplyr, you can find it in https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html. We’ll try to display how to join data tables in this vignette. First, load the packages we need and get some data.

library(tidydt)
#> 
#> Life's short, use R.
library(nycflights13)

flights2 <- flights %>% 
  select_dt(year,month,day, hour, origin, dest, tailnum, carrier)

Do a left join with a simple:

Controlling how the tables are matched

Join works the same as dplyr:

flights2 %>% left_join_dt(weather)
#> Joining by: year,month,day,hour,origin
#>         year month day hour origin dest tailnum carrier  temp  dewp humid
#>      1: 2013     1   1    5    EWR  IAH  N14228      UA 39.02 28.04 64.43
#>      2: 2013     1   1    5    EWR  ORD  N39463      UA 39.02 28.04 64.43
#>      3: 2013     1   1    5    JFK  MIA  N619AA      AA 39.02 26.96 61.63
#>      4: 2013     1   1    5    JFK  BQN  N804JB      B6 39.02 26.96 61.63
#>      5: 2013     1   1    5    JFK  BOS  N708JB      B6 39.02 26.96 61.63
#>     ---                                                                  
#> 336772: 2013    12  31   23    EWR  SJU  N651JB      B6    NA    NA    NA
#> 336773: 2013    12  31   23    JFK  BQN  N566JB      B6    NA    NA    NA
#> 336774: 2013    12  31   23    JFK  SJU  N713TW      DL    NA    NA    NA
#> 336775: 2013    12  31   23    JFK  SJU  N509JB      B6    NA    NA    NA
#> 336776: 2013    12  31   23    JFK  PSE  N665JB      B6    NA    NA    NA
#>         wind_dir wind_speed wind_gust precip pressure visib           time_hour
#>      1:      260   12.65858        NA      0   1011.9    10 2013-01-01 05:00:00
#>      2:      260   12.65858        NA      0   1011.9    10 2013-01-01 05:00:00
#>      3:      260   14.96014        NA      0   1012.1    10 2013-01-01 05:00:00
#>      4:      260   14.96014        NA      0   1012.1    10 2013-01-01 05:00:00
#>      5:      260   14.96014        NA      0   1012.1    10 2013-01-01 05:00:00
#>     ---                                                                        
#> 336772:       NA         NA        NA     NA       NA    NA                <NA>
#> 336773:       NA         NA        NA     NA       NA    NA                <NA>
#> 336774:       NA         NA        NA     NA       NA    NA                <NA>
#> 336775:       NA         NA        NA     NA       NA    NA                <NA>
#> 336776:       NA         NA        NA     NA       NA    NA                <NA>
flights2 %>% left_join_dt(planes, by = "tailnum")
#>         tailnum year.x month day hour origin dest carrier year.y type
#>      1:    <NA>   2013     1   2   15    JFK  LAX      AA     NA <NA>
#>      2:    <NA>   2013     1   2   16    EWR  ORD      UA     NA <NA>
#>      3:    <NA>   2013     1   3    8    EWR  MIA      UA     NA <NA>
#>      4:    <NA>   2013     1   3    6    EWR  DFW      UA     NA <NA>
#>      5:    <NA>   2013     1   4    8    JFK  DCA      9E     NA <NA>
#>     ---                                                              
#> 336772:  N9EAMQ   2013     9  27   16    LGA  ATL      MQ     NA <NA>
#> 336773:  N9EAMQ   2013     9  29   12    LGA  BNA      MQ     NA <NA>
#> 336774:  N9EAMQ   2013     9  29   18    LGA  CMH      MQ     NA <NA>
#> 336775:  N9EAMQ   2013     9  30   11    JFK  DCA      MQ     NA <NA>
#> 336776:  N9EAMQ   2013     9  30   14    JFK  TPA      MQ     NA <NA>
#>         manufacturer model engines seats speed engine
#>      1:         <NA>  <NA>      NA    NA    NA   <NA>
#>      2:         <NA>  <NA>      NA    NA    NA   <NA>
#>      3:         <NA>  <NA>      NA    NA    NA   <NA>
#>      4:         <NA>  <NA>      NA    NA    NA   <NA>
#>      5:         <NA>  <NA>      NA    NA    NA   <NA>
#>     ---                                              
#> 336772:         <NA>  <NA>      NA    NA    NA   <NA>
#> 336773:         <NA>  <NA>      NA    NA    NA   <NA>
#> 336774:         <NA>  <NA>      NA    NA    NA   <NA>
#> 336775:         <NA>  <NA>      NA    NA    NA   <NA>
#> 336776:         <NA>  <NA>      NA    NA    NA   <NA>
flights2 %>% left_join_dt(airports, c("dest" = "faa"))
#>         dest year month day hour origin tailnum carrier
#>      1:  ABQ 2013    10   1   20    JFK  N554JB      B6
#>      2:  ABQ 2013    10   2   20    JFK  N607JB      B6
#>      3:  ABQ 2013    10   3   20    JFK  N591JB      B6
#>      4:  ABQ 2013    10   4   20    JFK  N662JB      B6
#>      5:  ABQ 2013    10   5   19    JFK  N580JB      B6
#>     ---                                                
#> 336772:  XNA 2013     9  29   17    LGA  N725MQ      MQ
#> 336773:  XNA 2013     9  30    7    LGA  N735MQ      MQ
#> 336774:  XNA 2013     9  30    8    EWR  N14117      EV
#> 336775:  XNA 2013     9  30   15    LGA  N725MQ      MQ
#> 336776:  XNA 2013     9  30   17    LGA  N720MQ      MQ
#>                                      name      lat        lon  alt tz dst
#>      1: Albuquerque International Sunport 35.04022 -106.60919 5355 -7   A
#>      2: Albuquerque International Sunport 35.04022 -106.60919 5355 -7   A
#>      3: Albuquerque International Sunport 35.04022 -106.60919 5355 -7   A
#>      4: Albuquerque International Sunport 35.04022 -106.60919 5355 -7   A
#>      5: Albuquerque International Sunport 35.04022 -106.60919 5355 -7   A
#>     ---                                                                  
#> 336772:              NW Arkansas Regional 36.28187  -94.30681 1287 -6   A
#> 336773:              NW Arkansas Regional 36.28187  -94.30681 1287 -6   A
#> 336774:              NW Arkansas Regional 36.28187  -94.30681 1287 -6   A
#> 336775:              NW Arkansas Regional 36.28187  -94.30681 1287 -6   A
#> 336776:              NW Arkansas Regional 36.28187  -94.30681 1287 -6   A
#>                   tzone
#>      1:  America/Denver
#>      2:  America/Denver
#>      3:  America/Denver
#>      4:  America/Denver
#>      5:  America/Denver
#>     ---                
#> 336772: America/Chicago
#> 336773: America/Chicago
#> 336774: America/Chicago
#> 336775: America/Chicago
#> 336776: America/Chicago
flights2 %>% left_join_dt(airports, c("origin" = "faa"))
#>         origin year month day hour dest tailnum carrier                name
#>      1:    EWR 2013     1   1    5  IAH  N14228      UA Newark Liberty Intl
#>      2:    EWR 2013     1   1    5  ORD  N39463      UA Newark Liberty Intl
#>      3:    EWR 2013     1   1    6  FLL  N516JB      B6 Newark Liberty Intl
#>      4:    EWR 2013     1   1    6  SFO  N53441      UA Newark Liberty Intl
#>      5:    EWR 2013     1   1    6  LAS  N76515      UA Newark Liberty Intl
#>     ---                                                                    
#> 336772:    LGA 2013     9  30   18  BNA  N740EV      EV          La Guardia
#> 336773:    LGA 2013     9  30   22  SYR    <NA>      9E          La Guardia
#> 336774:    LGA 2013     9  30   12  BNA  N535MQ      MQ          La Guardia
#> 336775:    LGA 2013     9  30   11  CLE  N511MQ      MQ          La Guardia
#> 336776:    LGA 2013     9  30    8  RDU  N839MQ      MQ          La Guardia
#>              lat       lon alt tz dst            tzone
#>      1: 40.69250 -74.16867  18 -5   A America/New_York
#>      2: 40.69250 -74.16867  18 -5   A America/New_York
#>      3: 40.69250 -74.16867  18 -5   A America/New_York
#>      4: 40.69250 -74.16867  18 -5   A America/New_York
#>      5: 40.69250 -74.16867  18 -5   A America/New_York
#>     ---                                               
#> 336772: 40.77725 -73.87261  22 -5   A America/New_York
#> 336773: 40.77725 -73.87261  22 -5   A America/New_York
#> 336774: 40.77725 -73.87261  22 -5   A America/New_York
#> 336775: 40.77725 -73.87261  22 -5   A America/New_York
#> 336776: 40.77725 -73.87261  22 -5   A America/New_York

Types of join

If all you have is a data.frame or tibble, you have no need to change the format. Feed the data directly:

The "_dt" suffix should remind you that this is backed up by data.table and will always return a data.table in the end.

Filtering joins

Filtering joins have also been supported in tidydt.

Other examples (semi_join_dt() and anti_join_dt() never duplicate; they only ever remove observations.):

Set operations

For set operations, use data.table’s function directly. tidydt have exported the following functions:

  • fintersect(x, y, all = FALSE)

  • fsetdiff(x, y, all = FALSE)

  • funion(x, y, all = FALSE)

  • fsetequal(x, y, all = TRUE)

    For details, just find the help from data.table using ?setops.