This part of vignette has referred to dplyr
’s vignette
in https://dplyr.tidyverse.org/articles/dplyr.html. We’ll
try to reproduce all the results. First load the needed packages.
library(tidyfst)
library(nycflights13)
library(data.table)
data.table(flights)
filter_dt()
filter_dt(flights, month == 1 & day == 1)
Note that comma could not be used in the expressions. Which means
filter_dt(flights, month == 1,day == 1)
would return error.
## Arrange rows with arrange_dt()
arrange_dt(flights, year, month, day)
Use -
(minus symbol) to order a column in descending
order:
arrange_dt(flights, -arr_delay)
select_dt()
select_dt(flights, year, month, day)
select_dt(flights, year:day)
and
select_dt(flights, -(year:day))
are not supported. But I
have added a feature to help select with regular expression, which means
you can:
select_dt(flights, "^dep")
The rename process is almost the same as that in
dplyr
:
mutate_dt()
mutate_dt(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
However, if you just create the column, please split them. The following codes would not work:
mutate_dt(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
Instead, use:
mutate_dt(flights,gain = arr_delay - dep_delay) %>%
mutate_dt(gain_per_hour = gain / (air_time / 60))
If you only want to keep the new variables, use
transmute_dt()
:
transmute_dt(flights,
gain = arr_delay - dep_delay
)
summarise_dt()
summarise_dt(flights,
delay = mean(dep_delay, na.rm = TRUE)
)
sample_n_dt()
and
sample_frac_dt()
sample_n_dt(flights, 10)
sample_frac_dt(flights, 0.01)
For the below dplyr
codes:
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
We could get it via:
flights %>%
summarise_dt( count = .N,
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE),by = tailnum)
summarise_dt
(or summarize_dt
) has a
parameter “by”, you can specify the group. We could find the number of
planes and the number of flights that go to each possible
destination:
# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
# planes = n_distinct(tailnum),
# flights = n()
# )
summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>%
arrange_dt(dest)
If you need to group by many variables, use:
# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day <- summarise(daily, flights = n()))
flights %>%
summarise_dt(by = .(year,month,day),flights = .N)
# (per_month <- summarise(per_day, flights = sum(flights)))
flights %>%
summarise_dt(by = .(year,month,day),flights = .N) %>%
summarise_dt(by = .(year,month),flights = sum(flights))
# (per_year <- summarise(per_month, flights = sum(flights)))
flights %>%
summarise_dt(by = .(year,month,day),flights = .N) %>%
summarise_dt(by = .(year,month),flights = sum(flights)) %>%
summarise_dt(by = .(year),flights = sum(flights))
tidyfst provides a tidy syntax for data.table. For
such design, tidyfst never runs faster than the analogous
data.table codes. Nevertheless, it facilitate the dplyr-users
to gain the computation performance in no time and guide them to learn
more about data.table for speed. Below, we’ll compare the syntax of
tidyfst
and data.table
(referring to Introduction
to data.table). This could let you know how they are different, and
let users to choose their preference. Ideally, tidyfst will
lead even more users to learn more about data.table and its
wonderful features, so as to design more extentions for tidyfst
in the future.
Because we want a more stable data source, here we’ll use the flight
data from the above nycflights13
package.
library(tidyfst)
library(data.table)
library(nycflights13)
flights = data.table(flights) %>% na.omit()
# data.table
flights[, sum( (arr_delay + dep_delay) < 0)]
flights[origin == "JFK" & month == 6L,
.(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
flights[origin == "JFK" & month == 6L, length(dest)]
flights[origin == "JFK" & month == 6L, .N]
# tidyfst
flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0))
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay))
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
nrow()
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
count_dt()
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
summarise_dt(.N)
In the above examples, we could learn that in tidyfst, you
could still use the methods in data.table, such as .N
.
# data.table
flights[, c("arr_delay", "dep_delay")]
select_cols = c("arr_delay", "dep_delay")
flights[ , ..select_cols]
flights[ , select_cols, with = FALSE]
flights[, !c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]
# returns year,month and day
flights[, year:day]
# returns day, month and year
flights[, day:year]
# returns all columns except year, month and day
flights[, -(year:day)]
flights[, !(year:day)]
# tidyfst
flights %>% select_dt(c("arr_delay", "dep_delay"))
select_cols = c("arr_delay", "dep_delay")
flights %>% select_dt(cols = select_cols)
flights %>% select_dt(-arr_delay,-dep_delay)
flights %>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day))
# data.table
flights[, .N, by = .(origin)]
flights[carrier == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
# tidyfst
flights %>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>%
summarise_dt(mean(arr_delay), mean(dep_delay),
by = .(origin, dest, month))
Note that currently keyby
is not used in
tidyfst. This featuer might be included in the future for
better performance in order-independent tasks. Moreover,
count_dt
is sorted automatically by the counted number,
this could be controlled by the parameter “sort”.
# data.table
flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[, .N, .(dep_delay>0, arr_delay>0)]
# tidyfst
flights %>%
filter_dt(carrier == "AA") %>%
count_dt(origin,dest,sort = FALSE) %>%
arrange_dt(origin,-dest)
flights %>%
summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))
Now let’s try a more complex example:
# data.table
flights[carrier == "AA",
lapply(.SD, mean),
by = .(origin, dest, month),
.SDcols = c("arr_delay", "dep_delay")]
# tidyfst
flights %>%
filter_dt(carrier == "AA") %>%
group_dt(
by = .(origin, dest, month),
at_dt("_delay",summarise_dt,mean)
)
Let me explain what happens here, especially in
group_dt
. First filter by condition
carrier == "AA"
, then group by three variables, which are
origin, dest, month
. Last, summarise by columns with
“_delay” in the column names and get the mean value of all such
variables(with “_delay” in their column names). This is a very creative
design, utilizing .SD
in data.table and upgrade
the group_by
function in dplyr (because you never
need to ungroup
now, just put the group operations in the
group_dt
). And you can pipe in the group_dt
function. Let’s play with it a little bit further:
flights %>%
filter_dt(carrier == "AA") %>%
group_dt(
by = .(origin, dest, month),
at_dt("_delay",summarise_dt,mean) %>%
mutate_dt(sum = dep_delay + arr_delay)
)
However, I don’t recommend using it if you don’t acutually need it
for group computation (just start another pipe follows
group_dt
). Now let’s end with some easy examples:
Deep inside, tidyfst is born from dplyr and data.table, and use stringr to make flexible APIs, so as to bring their superiority into full play.