This part of vignette has referred to `dplyr`

’s vignette
in https://dplyr.tidyverse.org/articles/dplyr.html. We’ll
try to reproduce all the results. First load the needed packages.

```
library(tidyfst)
library(nycflights13)
library(data.table)
data.table(flights)
```

`filter_dt()`

`filter_dt(flights, month == 1 & day == 1)`

Note that comma could not be used in the expressions. Which means
`filter_dt(flights, month == 1,day == 1)`

would return error.
## Arrange rows with `arrange_dt()`

`arrange_dt(flights, year, month, day)`

Use `-`

(minus symbol) to order a column in descending
order:

`arrange_dt(flights, -arr_delay)`

`select_dt()`

`select_dt(flights, year, month, day)`

`select_dt(flights, year:day)`

and
`select_dt(flights, -(year:day))`

are not supported. But I
have added a feature to help select with regular expression, which means
you can:

`select_dt(flights, "^dep")`

The rename process is almost the same as that in
`dplyr`

:

`mutate_dt()`

```
mutate_dt(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
```

However, if you just create the column, please split them. The following codes would not work:

```
mutate_dt(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
```

Instead, use:

```
mutate_dt(flights,gain = arr_delay - dep_delay) %>%
mutate_dt(gain_per_hour = gain / (air_time / 60))
```

If you only want to keep the new variables, use
`transmute_dt()`

:

```
transmute_dt(flights,
gain = arr_delay - dep_delay
)
```

`summarise_dt()`

```
summarise_dt(flights,
delay = mean(dep_delay, na.rm = TRUE)
)
```

`sample_n_dt()`

and
`sample_frac_dt()`

```
sample_n_dt(flights, 10)
sample_frac_dt(flights, 0.01)
```

For the below `dplyr`

codes:

```
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
```

We could get it via:

```
flights %>%
summarise_dt( count = .N,
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE),by = tailnum)
```

`summarise_dt`

(or `summarize_dt`

) has a
parameter “by”, you can specify the group. We could find the number of
planes and the number of flights that go to each possible
destination:

```
# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
# planes = n_distinct(tailnum),
# flights = n()
# )
summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>%
arrange_dt(dest)
```

If you need to group by many variables, use:

```
# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day <- summarise(daily, flights = n()))
flights %>%
summarise_dt(by = .(year,month,day),flights = .N)
# (per_month <- summarise(per_day, flights = sum(flights)))
flights %>%
summarise_dt(by = .(year,month,day),flights = .N) %>%
summarise_dt(by = .(year,month),flights = sum(flights))
# (per_year <- summarise(per_month, flights = sum(flights)))
flights %>%
summarise_dt(by = .(year,month,day),flights = .N) %>%
summarise_dt(by = .(year,month),flights = sum(flights)) %>%
summarise_dt(by = .(year),flights = sum(flights))
```

*tidyfst* provides a tidy syntax for *data.table*. For
such design, *tidyfst* never runs faster than the analogous
*data.table* codes. Nevertheless, it facilitate the dplyr-users
to gain the computation performance in no time and guide them to learn
more about data.table for speed. Below, we’ll compare the syntax of
`tidyfst`

and `data.table`

(referring to Introduction
to data.table). This could let you know how they are different, and
let users to choose their preference. Ideally, *tidyfst* will
lead even more users to learn more about *data.table* and its
wonderful features, so as to design more extentions for *tidyfst*
in the future.

Because we want a more stable data source, here we’ll use the flight
data from the above `nycflights13`

package.

```
library(tidyfst)
library(data.table)
library(nycflights13)
flights = data.table(flights) %>% na.omit()
```

```
# data.table
flights[, sum( (arr_delay + dep_delay) < 0)]
flights[origin == "JFK" & month == 6L,
.(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
flights[origin == "JFK" & month == 6L, length(dest)]
flights[origin == "JFK" & month == 6L, .N]
# tidyfst
flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0))
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay))
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
nrow()
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
count_dt()
flights %>%
filter_dt(origin == "JFK" & month == 6L) %>%
summarise_dt(.N)
```

In the above examples, we could learn that in *tidyfst*, you
could still use the methods in data.table, such as `.N`

.

```
# data.table
flights[, c("arr_delay", "dep_delay")]
select_cols = c("arr_delay", "dep_delay")
flights[ , ..select_cols]
flights[ , select_cols, with = FALSE]
flights[, !c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]
# returns year,month and day
flights[, year:day]
# returns day, month and year
flights[, day:year]
# returns all columns except year, month and day
flights[, -(year:day)]
flights[, !(year:day)]
# tidyfst
flights %>% select_dt(c("arr_delay", "dep_delay"))
select_cols = c("arr_delay", "dep_delay")
flights %>% select_dt(cols = select_cols)
flights %>% select_dt(-arr_delay,-dep_delay)
flights %>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day))
```

```
# data.table
flights[, .N, by = .(origin)]
flights[carrier == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
# tidyfst
flights %>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>%
summarise_dt(mean(arr_delay), mean(dep_delay),
by = .(origin, dest, month))
```

Note that currently `keyby`

is not used in
*tidyfst*. This featuer might be included in the future for
better performance in order-independent tasks. Moreover,
`count_dt`

is sorted automatically by the counted number,
this could be controlled by the parameter “sort”.

```
# data.table
flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[, .N, .(dep_delay>0, arr_delay>0)]
# tidyfst
flights %>%
filter_dt(carrier == "AA") %>%
count_dt(origin,dest,sort = FALSE) %>%
arrange_dt(origin,-dest)
flights %>%
summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))
```

Now let’s try a more complex example:

```
# data.table
flights[carrier == "AA",
lapply(.SD, mean),
by = .(origin, dest, month),
.SDcols = c("arr_delay", "dep_delay")]
# tidyfst
flights %>%
filter_dt(carrier == "AA") %>%
group_dt(
by = .(origin, dest, month),
at_dt("_delay",summarise_dt,mean)
)
```

Let me explain what happens here, especially in
`group_dt`

. First filter by condition
`carrier == "AA"`

, then group by three variables, which are
`origin, dest, month`

. Last, summarise by columns with
“_delay” in the column names and get the mean value of all such
variables(with “_delay” in their column names). This is a very creative
design, utilizing `.SD`

in *data.table* and upgrade
the `group_by`

function in *dplyr* (because you never
need to `ungroup`

now, just put the group operations in the
`group_dt`

). And **you can pipe in the group_dt
function**. Let’s play with it a little bit further:

```
flights %>%
filter_dt(carrier == "AA") %>%
group_dt(
by = .(origin, dest, month),
at_dt("_delay",summarise_dt,mean) %>%
mutate_dt(sum = dep_delay + arr_delay)
)
```

However, I don’t recommend using it if you don’t acutually need it
for group computation (just start another pipe follows
`group_dt`

). Now let’s end with some easy examples:

Deep inside, *tidyfst* is born from *dplyr* and
*data.table*, and use *stringr* to make flexible APIs, so
as to bring their superiority into full play.