For absolute physical speed, use data.table directly. While
the learning curve might be longer, the improvement of computation
performance pays off if you are dealing with large datasets frequently.
There are several ways to cut into data.table syntax to gain
higher performance in tidyfst. A convenient way is to use the
DT[I,J,BY]
syntax after the pipe(%>%
).
library(tidyfst)
#> Thank you for using tidyfst!
#> To acknowledge our work, please cite the package:
#> Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388
iris %>%
as_dt()%>% #coerce a data.frame to data.table
.[,.SD[1],by = Species]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fctr> <num> <num> <num> <num>
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: versicolor 7.0 3.2 4.7 1.4
#> 3: virginica 6.3 3.3 6.0 2.5
This syntax is not so consistent with the tidy syntax, therefore
in_dt
is also designed for the short cut to
data.table method, which could be used as:
iris %>%
in_dt(,.SD[1],by = Species)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fctr> <num> <num> <num> <num>
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: versicolor 7.0 3.2 4.7 1.4
#> 3: virginica 6.3 3.3 6.0 2.5
in_dt
follows the basic principals of tidyfst,
which include: (1) Never use in place replacement. Therefore, the in
place functions like :=
will still return the results. (2)
Always recieves a data frame (data.frame/tibble/data.table) and returns
a data.table. This means you don’t have to write
as.data.table
or as_dt
all the time as long as
you are working on data frames in R.