One may wonder how fast is tidyfst. Well, it depends. Generally, it is as fast as data.table because it is backed by it, but it would spend extra time on the generation of data.table codes. This extra time is marginal on large (and even small) data sets.

Now let’s do a test to compare the performance of tidyfst, data.table and dplyr. In the vignette we’ll use a small data set. The example was provided by the data.table package (https://h2oai.github.io/db-benchmark/) and tweaked here. These tests are based on computation by groups.

First let’s load the package and generate some data.

# load packages
library(tidyfst)
#> Thank you for using tidyfst!
#> To acknowledge our work, please cite the package:
#> Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388
library(data.table)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:tidyfst':
#> 
#>     between, cummean, nth
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(bench)

# generate the data
# if you have a HPC and want to try larger data sets, increase N
N = 1e4 
K = 1e2

set.seed(2020)

cat(sprintf("Producing data of %s rows and %s K groups factors\n", N, K))
#> Producing data of 10000 rows and 100 K groups factors

DT = data.table(

  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)

  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)

  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)

  id4 = sample(K, N, TRUE),                          # large groups (int)

  id5 = sample(K, N, TRUE),                          # large groups (int)

  id6 = sample(N/K, N, TRUE),                        # small groups (int)

  v1 =  sample(5, N, TRUE),                          # int in range [1,5]

  v2 =  sample(5, N, TRUE),                          # int in range [1,5]

  v3 =  round(runif(N,max=100),4)                    # numeric e.g. 23.5749

)

object_size(DT)
#> 527.7 Kb

This data is rather small, the size is around 527 Kb. However, with the bench package, we could detect the difference by increasing iteration times. In this way, examples listed here could be implemented even on relatively low performance computers.

Q1

Here, we try to get median and standard deviation by groups.After dplyr v1.0.0, the regrouping feature could be confusing sometimes (comes with warning message). If you are using it, make sure they are in the right groups before grouped computation. In tidyfst and data.table, we have “by” parameter to specify the groups. Here we would not check if the results are equal, because dplyr will return a tibble class even when we input a data.table in the first place. The iteration time is 10 for each of the test below.

bench::mark(
  data.table = DT[,.(median_v3 = median(v3),
                     sd_v3 = sd(v3)),
                  by = .(id4,id5)],
  tidyfst = DT %>%
    summarise_dt(
      by = "id4,id5",
      median_v3 = median(v3),
      sd_v3 = sd(v3)
    ),
  dplyr = DT %>%
    group_by(id4,id5,.drop = TRUE) %>%
    summarise(median_v3 = median(v3),sd_v3 = sd(v3)),
  check = FALSE,iterations = 10
) -> q1
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.
#> `summarise()` has grouped output by 'id4'. You can override using the `.groups`
#> argument.

q1
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 data.table   1.59ms   1.66ms    577.      2.21MB      0  
#> 2 tidyfst      1.64ms   1.71ms    585.    663.61KB      0  
#> 3 dplyr      243.79ms 248.43ms      3.82    5.57MB     15.6

We could find that spent time of tidyfst and data.table are quite similar, but much less than dplyr.

Q2

This example performs quite similar to the above one. tidyfst might spend a tiny little more time and space on code translation than data.table, but still performs much better than dplyr.

bench::mark(
  data.table =DT[,.(range_v1_v2 = max(v1) - min(v2)),by = id3],
  tidyfst = DT %>% summarise_dt(
    by = id3,
    range_v1_v2 = max(v1) - min(v2)
  ),
  dplyr = DT %>%
    group_by(id3,.drop = TRUE) %>%
    summarise(range_v1_v2 = max(v1) - min(v2)),
  check = FALSE,iterations = 10
) -> q2

q2
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 data.table  709.5µs  736.6µs     1300.    92.9KB        0
#> 2 tidyfst     666.6µs 696.05µs     1439.    92.9KB        0
#> 3 dplyr        3.11ms   3.33ms      301.   475.1KB        0

Q3

Here we’ll display a rather different test to show the flexibly in tidyfst. In tidyfst, if your code writes more like data.table, the codes could speed up. If you write it more like dplyr, the codes might be more readable but slows down. In tidyfst, there is in_dt function for you to write data.table codes to gain speed when you meet a bottomneck.

In the following example, we use the exact same syntax of data.table in tidyfst::in_dt.

bench::mark(
  data.table =DT[order(-v3),.(largest2_v3 = head(v3,2L)),by = id6],
  tidyfst = DT %>%
    in_dt(order(-v3),.(largest2_v3 = head(v3,2L)),by = id6),
  dplyr = DT %>%
    select(id6,largest2_v3 = v3) %>%
    group_by(id6) %>%
    slice_max(largest2_v3,n = 2,with_ties = FALSE),
  check = FALSE,iterations = 10
) -> q3

q3
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 data.table   1.95ms   1.99ms     489.      396KB      0  
#> 2 tidyfst       2.2ms   2.35ms     420.      842KB      0  
#> 3 dplyr       18.13ms  18.31ms      54.3       2MB     13.6

Q4

To summarise multiple columns by group, tidyfst has designed a function named summarise_vars, which is even more convenient than the across function in dplyr. It first choose the columns, then tell it what to do, and you can provide the “by” parameter to operate by groups (optional).

bench::mark(
  data.table =DT[,lapply(.SD,mean),by = id4,.SDcols = v1:v3],
  tidyfst = DT %>%
    summarise_vars(
      v1:v3,
      mean,
      by = id4
    ),
  dplyr = DT %>%
    group_by(id4) %>%
    summarise(across(v1:v3,mean)),
  check = FALSE,iterations = 10
) -> q4

q4
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 data.table  844.8µs 902.55µs     1059.     335KB      0  
#> 2 tidyfst      2.97ms   3.06ms      326.     303KB      0  
#> 3 dplyr        6.22ms   6.56ms      152.     560KB     16.9

Take a look at the performance, tidyfst still lies between data.table and dplyr.

Q5

Now let’s try more groups, here we use all the id (id1~id6) as group, and get the sum and count. Note that tidyfst is written in data.table, so it do not use n() in dplyr but .N in data.table to get counts by group.

bench::mark(
  data.table =DT[,.(v3 = sum(v3),count = .N),by = id1:id6],
  tidyfst = DT %>%
    summarise_dt(
      by = id1:id6,
      v3 = sum(v3),
      count = .N
    ),
  dplyr = DT %>%
    group_by(id1,id2,id3,id4,id5,id6) %>%
    summarise(v3 = sum(v3),count = n()),
  check = FALSE,iterations = 10
) -> q5
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.
#> `summarise()` has grouped output by 'id1', 'id2', 'id3', 'id4', 'id5'. You can
#> override using the `.groups` argument.

q5
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 data.table   2.12ms    2.2ms     444.     1.03MB      0  
#> 2 tidyfst      2.18ms   2.27ms     438.     1.03MB      0  
#> 3 dplyr       81.86ms  85.23ms      11.6    3.71MB     16.3

Last words

While in a data set of ~0.5 Mb we find that the performance of tidyfst lies between data.table and dplyr, we could discover that the speed is much closer to data.table. In fact, if you try a much larger data set in a computer with large RAM and multiple cores, you’ll find that the performance of tidyfst sticks close to data.table. If you are interested and has a high-performance computer, try to generate a larger data set and test out. Moreover, while the dplyr user might find these data manipulation verbs friendly, the innate syntax of tidyfst is more like data.table, and could be a good companion of data.table for some frequently used complex tasks.

Session information

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bench_1.1.3       dplyr_1.1.2       data.table_1.14.8 tidyfst_1.7.7    
#> 
#> loaded via a namespace (and not attached):
#>  [1] jsonlite_1.8.7    compiler_4.3.1    tidyselect_1.2.0  Rcpp_1.0.11      
#>  [5] stringr_1.5.0     parallel_4.3.1    jquerylib_0.1.4   systemfonts_1.0.4
#>  [9] textshaping_0.3.6 yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#> [13] generics_0.1.3    knitr_1.43        tibble_3.2.1      desc_1.4.2       
#> [17] rprojroot_2.0.3   bslib_0.5.0       pillar_1.9.0      rlang_1.1.1      
#> [21] utf8_1.2.3        cachem_1.0.8      stringi_1.7.12    xfun_0.39        
#> [25] fs_1.6.2          sass_0.4.7        memoise_2.0.1     cli_3.6.1        
#> [29] withr_2.5.0       pkgdown_2.0.7     magrittr_2.0.3    digest_0.6.33    
#> [33] rstudioapi_0.15.0 fst_0.9.8         lifecycle_1.0.3   vctrs_0.6.3      
#> [37] evaluate_0.21     glue_1.6.2        ragg_1.2.5        profmem_0.6.0    
#> [41] fansi_1.0.4       fstcore_0.9.14    rmarkdown_2.23    purrr_1.0.1      
#> [45] pkgconfig_2.0.3   tools_4.3.1       htmltools_0.5.5