This post has referred to a vignette from dplyr
, you can find it in https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html. We’ll try to display how to join data tables in this vignette. First, load the packages we need and get some data.
library(tidydt)
#>
#> Life's short, use R.
library(nycflights13)
flights2 <- flights %>%
select_dt(year,month,day, hour, origin, dest, tailnum, carrier)
Do a left join with a simple:
flights2 %>%
left_join_dt(airlines)
#> Joining by: carrier
#> carrier year month day hour origin dest tailnum name
#> 1: 9E 2013 1 1 8 JFK MSP N915XJ Endeavor Air Inc.
#> 2: 9E 2013 1 1 15 JFK IAD N8444F Endeavor Air Inc.
#> 3: 9E 2013 1 1 14 JFK BUF N920XJ Endeavor Air Inc.
#> 4: 9E 2013 1 1 15 JFK SYR N8409N Endeavor Air Inc.
#> 5: 9E 2013 1 1 15 JFK ROC N8631E Endeavor Air Inc.
#> ---
#> 336772: YV 2013 9 29 16 LGA IAD N518LR Mesa Airlines Inc.
#> 336773: YV 2013 9 29 17 LGA CLT N932LR Mesa Airlines Inc.
#> 336774: YV 2013 9 30 16 LGA IAD N510MJ Mesa Airlines Inc.
#> 336775: YV 2013 9 30 17 LGA CLT N905FJ Mesa Airlines Inc.
#> 336776: YV 2013 9 30 20 LGA CLT N924FJ Mesa Airlines Inc.
Join works the same as dplyr
:
flights2 %>% left_join_dt(weather)
#> Joining by: year,month,day,hour,origin
#> year month day hour origin dest tailnum carrier temp dewp humid
#> 1: 2013 1 1 5 EWR IAH N14228 UA 39.02 28.04 64.43
#> 2: 2013 1 1 5 EWR ORD N39463 UA 39.02 28.04 64.43
#> 3: 2013 1 1 5 JFK MIA N619AA AA 39.02 26.96 61.63
#> 4: 2013 1 1 5 JFK BQN N804JB B6 39.02 26.96 61.63
#> 5: 2013 1 1 5 JFK BOS N708JB B6 39.02 26.96 61.63
#> ---
#> 336772: 2013 12 31 23 EWR SJU N651JB B6 NA NA NA
#> 336773: 2013 12 31 23 JFK BQN N566JB B6 NA NA NA
#> 336774: 2013 12 31 23 JFK SJU N713TW DL NA NA NA
#> 336775: 2013 12 31 23 JFK SJU N509JB B6 NA NA NA
#> 336776: 2013 12 31 23 JFK PSE N665JB B6 NA NA NA
#> wind_dir wind_speed wind_gust precip pressure visib time_hour
#> 1: 260 12.65858 NA 0 1011.9 10 2013-01-01 05:00:00
#> 2: 260 12.65858 NA 0 1011.9 10 2013-01-01 05:00:00
#> 3: 260 14.96014 NA 0 1012.1 10 2013-01-01 05:00:00
#> 4: 260 14.96014 NA 0 1012.1 10 2013-01-01 05:00:00
#> 5: 260 14.96014 NA 0 1012.1 10 2013-01-01 05:00:00
#> ---
#> 336772: NA NA NA NA NA NA <NA>
#> 336773: NA NA NA NA NA NA <NA>
#> 336774: NA NA NA NA NA NA <NA>
#> 336775: NA NA NA NA NA NA <NA>
#> 336776: NA NA NA NA NA NA <NA>
flights2 %>% left_join_dt(planes, by = "tailnum")
#> tailnum year.x month day hour origin dest carrier year.y type
#> 1: <NA> 2013 1 2 15 JFK LAX AA NA <NA>
#> 2: <NA> 2013 1 2 16 EWR ORD UA NA <NA>
#> 3: <NA> 2013 1 3 8 EWR MIA UA NA <NA>
#> 4: <NA> 2013 1 3 6 EWR DFW UA NA <NA>
#> 5: <NA> 2013 1 4 8 JFK DCA 9E NA <NA>
#> ---
#> 336772: N9EAMQ 2013 9 27 16 LGA ATL MQ NA <NA>
#> 336773: N9EAMQ 2013 9 29 12 LGA BNA MQ NA <NA>
#> 336774: N9EAMQ 2013 9 29 18 LGA CMH MQ NA <NA>
#> 336775: N9EAMQ 2013 9 30 11 JFK DCA MQ NA <NA>
#> 336776: N9EAMQ 2013 9 30 14 JFK TPA MQ NA <NA>
#> manufacturer model engines seats speed engine
#> 1: <NA> <NA> NA NA NA <NA>
#> 2: <NA> <NA> NA NA NA <NA>
#> 3: <NA> <NA> NA NA NA <NA>
#> 4: <NA> <NA> NA NA NA <NA>
#> 5: <NA> <NA> NA NA NA <NA>
#> ---
#> 336772: <NA> <NA> NA NA NA <NA>
#> 336773: <NA> <NA> NA NA NA <NA>
#> 336774: <NA> <NA> NA NA NA <NA>
#> 336775: <NA> <NA> NA NA NA <NA>
#> 336776: <NA> <NA> NA NA NA <NA>
flights2 %>% left_join_dt(airports, c("dest" = "faa"))
#> dest year month day hour origin tailnum carrier
#> 1: ABQ 2013 10 1 20 JFK N554JB B6
#> 2: ABQ 2013 10 2 20 JFK N607JB B6
#> 3: ABQ 2013 10 3 20 JFK N591JB B6
#> 4: ABQ 2013 10 4 20 JFK N662JB B6
#> 5: ABQ 2013 10 5 19 JFK N580JB B6
#> ---
#> 336772: XNA 2013 9 29 17 LGA N725MQ MQ
#> 336773: XNA 2013 9 30 7 LGA N735MQ MQ
#> 336774: XNA 2013 9 30 8 EWR N14117 EV
#> 336775: XNA 2013 9 30 15 LGA N725MQ MQ
#> 336776: XNA 2013 9 30 17 LGA N720MQ MQ
#> name lat lon alt tz dst
#> 1: Albuquerque International Sunport 35.04022 -106.60919 5355 -7 A
#> 2: Albuquerque International Sunport 35.04022 -106.60919 5355 -7 A
#> 3: Albuquerque International Sunport 35.04022 -106.60919 5355 -7 A
#> 4: Albuquerque International Sunport 35.04022 -106.60919 5355 -7 A
#> 5: Albuquerque International Sunport 35.04022 -106.60919 5355 -7 A
#> ---
#> 336772: NW Arkansas Regional 36.28187 -94.30681 1287 -6 A
#> 336773: NW Arkansas Regional 36.28187 -94.30681 1287 -6 A
#> 336774: NW Arkansas Regional 36.28187 -94.30681 1287 -6 A
#> 336775: NW Arkansas Regional 36.28187 -94.30681 1287 -6 A
#> 336776: NW Arkansas Regional 36.28187 -94.30681 1287 -6 A
#> tzone
#> 1: America/Denver
#> 2: America/Denver
#> 3: America/Denver
#> 4: America/Denver
#> 5: America/Denver
#> ---
#> 336772: America/Chicago
#> 336773: America/Chicago
#> 336774: America/Chicago
#> 336775: America/Chicago
#> 336776: America/Chicago
flights2 %>% left_join_dt(airports, c("origin" = "faa"))
#> origin year month day hour dest tailnum carrier name
#> 1: EWR 2013 1 1 5 IAH N14228 UA Newark Liberty Intl
#> 2: EWR 2013 1 1 5 ORD N39463 UA Newark Liberty Intl
#> 3: EWR 2013 1 1 6 FLL N516JB B6 Newark Liberty Intl
#> 4: EWR 2013 1 1 6 SFO N53441 UA Newark Liberty Intl
#> 5: EWR 2013 1 1 6 LAS N76515 UA Newark Liberty Intl
#> ---
#> 336772: LGA 2013 9 30 18 BNA N740EV EV La Guardia
#> 336773: LGA 2013 9 30 22 SYR <NA> 9E La Guardia
#> 336774: LGA 2013 9 30 12 BNA N535MQ MQ La Guardia
#> 336775: LGA 2013 9 30 11 CLE N511MQ MQ La Guardia
#> 336776: LGA 2013 9 30 8 RDU N839MQ MQ La Guardia
#> lat lon alt tz dst tzone
#> 1: 40.69250 -74.16867 18 -5 A America/New_York
#> 2: 40.69250 -74.16867 18 -5 A America/New_York
#> 3: 40.69250 -74.16867 18 -5 A America/New_York
#> 4: 40.69250 -74.16867 18 -5 A America/New_York
#> 5: 40.69250 -74.16867 18 -5 A America/New_York
#> ---
#> 336772: 40.77725 -73.87261 22 -5 A America/New_York
#> 336773: 40.77725 -73.87261 22 -5 A America/New_York
#> 336774: 40.77725 -73.87261 22 -5 A America/New_York
#> 336775: 40.77725 -73.87261 22 -5 A America/New_York
#> 336776: 40.77725 -73.87261 22 -5 A America/New_York
df1 <- data.table(x = c(1, 2), y = 2:1)
df2 <- data.table(x = c(1, 3), a = 10, b = "a")
df1 %>% inner_join_dt(df2)
#> Joining by: x
#> x y a b
#> 1: 1 2 10 a
df1 %>% left_join_dt(df2)
#> Joining by: x
#> x y a b
#> 1: 1 2 10 a
#> 2: 2 1 NA <NA>
df1 %>% right_join_dt(df2)
#> Joining by: x
#> x y a b
#> 1: 1 2 10 a
#> 2: 3 NA 10 a
df1 %>% full_join_dt(df2)
#> Joining by: x
#> x y a b
#> 1: 1 2 10 a
#> 2: 2 1 NA <NA>
#> 3: 3 NA 10 a
If all you have is a data.frame or tibble, you have no need to change the format. Feed the data directly:
df1 <- data.frame(x = c(1, 1, 2), y = 1:3)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join_dt(df2)
#> Joining by: x
#> x y z
#> 1: 1 1 a
#> 2: 1 1 b
#> 3: 1 2 a
#> 4: 1 2 b
#> 5: 2 3 a
The "_dt" suffix should remind you that this is backed up by data.table
and will always return a data.table in the end.
Filtering joins have also been supported in tidydt
.
flights %>%
anti_join_dt(planes, by = "tailnum") %>%
count_dt(tailnum, sort = TRUE)
#> tailnum n
#> 1: <NA> 2512
#> 2: N725MQ 575
#> 3: N722MQ 513
#> 4: N723MQ 507
#> 5: N713MQ 483
#> ---
#> 718: N7BKAA 1
#> 719: N7CAAA 1
#> 720: N5FCAA 1
#> 721: N5ERAA 1
#> 722: N647MQ 1
Other examples (semi_join_dt()
and anti_join_dt()
never duplicate; they only ever remove observations.):
df1 <- data.frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))
# Four rows to start with:
df1 %>% nrow()
#> [1] 4
# And we get four rows after the join
df1 %>% inner_join_dt(df2, by = "x") %>% nrow()
#> [1] 4
# But only two rows actually match
df1 %>% semi_join_dt(df2, by = "x") %>% nrow()
#> [1] 2