我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验,最近我自己在不断测试我的新包tidyfst,因此就拿这个data.table的案例来尝试一下。

测试数据构造

本次测试,将不会加载data.table包,但是其实tidyfst里面无处不是data.table的元素,而且也导出了很多内置的data.table函数,比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。

library(tidyfst)
#> Thank you for using tidyfst!
#> To acknowledge our work, please cite the package:
#> Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388
diamonds <- ggplot2::diamonds
n = 1e5  #如果想做工业级测试,可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
                     to = as.Date("2020-01-01"),
                     by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE)  # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

print(dat1)
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  2.23   Premium     J     VS2  61.0    58 14867  8.39  8.36
#>      2:  0.46     Ideal     F     SI2  61.5    54   758  4.98  5.01
#>      3:  0.70     Ideal     H    VVS1  60.9    57  3611  5.70  5.78
#>      4:  1.50   Premium     F     VS2  61.5    58 14719  7.34  7.32
#>      5:  0.40 Very Good     G      IF  61.0    59  1154  4.75  4.79
#>     ---                                                            
#>  99996:  1.14     Ideal     F     VS2  61.4    56  8017  6.73  6.78
#>  99997:  0.39     Ideal     F     SI1  61.8    55   886  4.70  4.72
#>  99998:  2.00      Good     F     SI1  63.0    64 17869  7.67  7.76
#>  99999:  1.00   Premium     H     VS2  59.7    59  5139  6.57  6.52
#> 100000:  0.30     Ideal     G    VVS2  60.6    57   878  4.35  4.33
#> 2 variable(s) not shown: [z <num>, dt <Date>]

基础

小技巧

后面的分析,经常要根据日期进行计算。所以,先对日期进行排序,就能够提高运行速度。在tidyfst中,可以使用arrange_dt函数来对数据进行原位的各种操作,其中就包括排序。

dat1 = arrange_dt(dat1,dt)
dat1
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                            
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]

那么,现在dat1的数据就按照日期排好序了。

聚合

1.求每种切割类型、每种颜色钻石的平均价格、中位数价格与最高价格

在tidyfst中,我设置了一个sys_time_print函数,可以方便地输出system.time()函数返回的结果。

sys_time_print({
  r1_1 <- dat1 %>% 
    summarise_dt(
      by = .(cut,color),
      mean_price = mean(price, na.rm = TRUE),
      median_price = median(price, na.rm = TRUE),
      max_price = max(price, na.rm = TRUE)
    )
})
#> [1] "# Finished in 0.000s elapsed (0.000s cpu)"
r1_1
#>           cut color mean_price median_price max_price
#>         <ord> <ord>      <num>        <num>     <int>
#>  1: Very Good     H   4390.470       3394.0     18803
#>  2:      Good     G   3999.886       3303.0     18788
#>  3:     Ideal     F   3309.061       1751.0     18780
#>  4:   Premium     E   3481.628       1881.0     18426
#>  5:      Fair     D   4237.899       3205.0     16386
#>  6:   Premium     J   6416.235       5110.5     18706
#>  7:   Premium     F   4347.158       2874.0     18791
#>  8:     Ideal     D   2608.030       1550.5     18693
#>  9: Very Good     G   3900.558       2403.0     18818
#> 10: Very Good     E   3301.873       2051.0     18731
#> 11:   Premium     G   4543.181       2780.0     18741
#> 12:     Ideal     G   3751.035       1881.0     18806
#> 13:     Ideal     E   2558.501       1429.0     18729
#> 14:   Premium     H   5215.019       4451.0     18795
#> 15:   Premium     I   5921.498       4545.5     18823
#> 16: Very Good     F   3792.863       2473.0     18777
#> 17:      Good     D   3369.352       2661.0     18468
#> 18:      Fair     G   4186.403       2797.0     18574
#> 19:      Good     E   3293.289       2407.0     18236
#> 20:   Premium     D   3796.609       2348.0     18286
#> 21:     Ideal     H   3924.682       2320.0     18659
#> 22: Very Good     I   5475.587       4013.0     18500
#> 23:     Ideal     I   4434.030       2631.0     18779
#> 24:      Fair     J   5029.502       3422.0     18531
#> 25: Very Good     D   3437.133       2242.0     18526
#> 26:      Good     F   3518.548       2755.0     18686
#> 27:      Good     I   4966.745       3484.0     18707
#> 28:      Good     H   4098.280       3323.0     18640
#> 29:      Fair     H   5454.118       4134.5     18308
#> 30:      Fair     F   3790.703       2961.0     17995
#> 31:     Ideal     J   4972.801       4135.0     18508
#> 32: Very Good     J   5107.638       4126.5     18430
#> 33:      Fair     I   4278.831       3022.0     18242
#> 34:      Good     J   4578.490       3690.0     18325
#> 35:      Fair     E   3834.358       2996.5     15584
#>           cut color mean_price median_price max_price

tidyfst是永远不可能比data.table快的,但是如果你觉得上面的代码更容易掌握、更容易读懂,而在日常工作中多花零点几秒的运行时间没有太大问题(实际上节省了大家的交流时间,甚至就是节省将来自己再次读懂自己代码的时间),tidyfst就值得拥有。

2.求每天最高出售价格对应的那笔订单

sys_time_print({
  r1_2 <- dat1 %>% 
    arrange_dt(dt,-price) %>% 
    drop_na_dt(price) %>% 
    group_dt(
      by = dt,
      head(1)
    )
})
#> [1] "# Finished in 0.020s elapsed (0.000s cpu)"
r1_2
#>               dt carat       cut color clarity depth table price     x
#>           <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>    1: 2011-01-01  2.10   Premium     J     VS2  58.3    54 12401  8.49
#>    2: 2011-01-02  2.02   Premium     I     VS2  59.0    59 17893  8.34
#>    3: 2011-01-03  1.61     Ideal     F     VS1  62.0    55 17414  7.55
#>    4: 2011-01-04  2.19   Premium     E     SI2  62.4    61 18232  8.31
#>    5: 2011-01-05  2.05   Premium     H     VS2  60.7    61 16235  8.25
#>   ---                                                                 
#> 3284: 2019-12-28  2.22     Ideal     J     SI2  62.8    57 13703  8.33
#> 3285: 2019-12-29  2.10 Very Good     J     VS2  61.3    59 14229  8.21
#> 3286: 2019-12-30  1.55 Very Good     H     VS2  63.3    56 10546  7.38
#> 3287: 2019-12-31  1.51 Very Good     F    VVS2  63.1    56 17317  7.32
#> 3288: 2020-01-01  2.04   Premium     J     VS2  63.0    58 13734  8.02
#> 2 variable(s) not shown: [y <num>, z <num>]

join

1.dat1与dat2以dt列左连接

实质上,merge函数已经优化得很好。tidyfst设计*_join系列函数的时候,只是为了一种不一样的语法结构来帮助实现不同的连接,因为它确实更加直观一些。但是实质上它还是merge.data.table函数的包装版本。

sys_time_print({
  r2_1 <- dat1 %>% 
    left_join_dt(dat2,by = "dt")
})
#> [1] "# Finished in 0.010s elapsed (0.010s cpu)"
r2_1
#> Key: <dt>
#>                 dt carat       cut color clarity depth table price     x
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92
#>     ---                                                                 
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38
#> 3 variable(s) not shown: [y <num>, z <num>, price1 <int>]

2.多重join

sys_time_print({
  mymerge <- function(x, y) left_join_dt(x, y, by = "dt")
  r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3))
})
#> [1] "# Finished in 0.030s elapsed (0.030s cpu)"
r2_2
#> Key: <dt>
#>                 dt carat       cut color clarity depth table price     x
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92
#>     ---                                                                 
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38
#> 4 variable(s) not shown: [y <num>, z <num>, price1 <int>, price2 <int>]

长宽表转换

1.长表转宽表

sys_time_print({
  mean1 <- function(x) mean(x, na.rm = TRUE)
  max1 <- function(x) max(x, na.rm = TRUE)
  r3_1 <-dat1 %>% 
    wider_dt(cut,
             value = c("depth", "price"),
             name = "color",
             fun = list(mean1,max1))
})
#> [1] "# Finished in 0.030s elapsed (0.000s cpu)"
r3_1
#> Key: <cut>
#>          cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#>        <ord>         <num>         <num>         <num>         <num>
#> 1:      Fair      64.15688      63.09882      63.41129      64.33663
#> 2:      Good      62.32176      62.16065      62.18220      62.60212
#> 3: Very Good      61.70454      61.73751      61.73138      61.83043
#> 4:   Premium      61.18875      61.16850      61.25098      61.27894
#> 5:     Ideal      61.67241      61.68693      61.65616      61.70585
#> 24 variable(s) not shown: [depth_mean1_H <num>, depth_mean1_I <num>, depth_mean1_J <num>, price_mean1_D <num>, price_mean1_E <num>, price_mean1_F <num>, price_mean1_G <num>, price_mean1_H <num>, price_mean1_I <num>, price_mean1_J <num>, ...]

2.宽表转长表

sys_time_print({
  r3_2 <-dat1 %>% 
    select_dt(cut,color,x,y,z) %>% 
    longer_dt(cut,color,
              name = "xyz",
              value = "xyzvalue")
})
#> [1] "# Finished in 0.000s elapsed (0.000s cpu)"

r3_2 
#>               cut color    xyz xyzvalue
#>             <ord> <ord> <fctr>    <num>
#>      1: Very Good     H      x     4.41
#>      2:      Good     G      x     4.26
#>      3:     Ideal     F      x     6.69
#>      4:   Premium     E      x     6.52
#>      5:      Fair     D      x     4.92
#>     ---                                
#> 299996:     Ideal     J      z     2.73
#> 299997:     Ideal     G      z     2.75
#> 299998: Very Good     I      z     2.58
#> 299999:     Ideal     G      z     2.66
#> 300000:     Ideal     F      z     2.74

高阶

向上/下填充空值

对于填充空值来说,可以这样操作:

sys_time_print({
  dat1 %>% fill_na_dt(price) -> dat1
})
#> [1] "# Finished in 0.010s elapsed (0.000s cpu)"
dat1
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                            
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]

添加子维度聚合结果为新列

1.以dat1为例,添加两列,一列为以cut、color聚合求price的均值,另一列是求标准差


sys_time_print({
  mutate_dt(dat1,
           mean_price = mean(price, na.rm = TRUE),
           sd_price = sd(price, na.rm = TRUE),
           by = .(cut, color))
})
#> [1] "# Finished in 0.050s elapsed (0.010s cpu)"

dat1
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                            
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]

2.以dat1为例,以dt分组添加一列序号id


sys_time_print({
  dat1 %>% 
  group_dt(
    by = dt,
    mutate_dt(id = seq(.N))
  ) -> dat1
})
#> [1] "# Finished in 0.790s elapsed (0.340s cpu)"
dat1
#>                 dt carat       cut color clarity depth table price     x
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92
#>     ---                                                                 
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38
#> 3 variable(s) not shown: [y <num>, z <num>, id <int>]

移动函数


sys_time_print({
  dat1 %>% 
    group_dt(
      by = color,
      mutate_dt(
        MA10_price = frollmean(price, 10),
        MSD10_price = frollapply(price, 10, FUN = sd)
      )
    ) -> dat1
})
#> [1] "# Finished in 0.420s elapsed (0.140s cpu)"

dat1
#>         color         dt carat       cut clarity depth table price     x
#>         <ord>     <Date> <num>     <ord>   <ord> <num> <num> <int> <num>
#>      1:     H 2011-01-01  0.34 Very Good     VS2  62.4    60   537  4.41
#>      2:     H 2011-01-01  1.20   Premium     SI2  62.1    57  3965  6.84
#>      3:     H 2011-01-01  1.51   Premium     SI1  58.0    59 10063  7.57
#>      4:     H 2011-01-02  1.00   Premium     VS2  60.7    60  4930  6.55
#>      5:     H 2011-01-02  1.01     Ideal     SI2  61.1    56  4693  6.44
#>     ---                                                                 
#>  99996:     I 2019-12-30  0.54     Ideal     SI1  62.1    54  1057  5.24
#>  99997:     I 2019-12-31  0.90 Very Good     VS2  59.3    59  3992  6.27
#>  99998:     I 2019-12-31  0.30     Ideal     SI1  61.3    56   506  4.33
#>  99999:     I 2020-01-01  1.22   Premium     VS2  62.3    59  5832  6.79
#> 100000:     I 2020-01-01  0.28 Very Good     VS2  61.9    57   379  4.16
#> 5 variable(s) not shown: [y <num>, z <num>, id <int>, MA10_price <num>, MSD10_price <num>]

系统参数

sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyfst_1.8.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5      jsonlite_1.8.8    dplyr_1.1.4       compiler_4.4.1   
#>  [5] tidyselect_1.2.1  Rcpp_1.0.13       stringr_1.5.1     parallel_4.4.1   
#>  [9] jquerylib_0.1.4   systemfonts_1.1.0 scales_1.3.0      textshaping_0.4.0
#> [13] yaml_2.3.9        fastmap_1.2.0     ggplot2_3.5.1     R6_2.5.1         
#> [17] generics_0.1.3    knitr_1.48        htmlwidgets_1.6.4 tibble_3.2.1     
#> [21] desc_1.4.3        munsell_0.5.1     bslib_0.7.0       pillar_1.9.0     
#> [25] rlang_1.1.4       utf8_1.2.4        cachem_1.1.0      stringi_1.8.4    
#> [29] xfun_0.46         fs_1.6.4          sass_0.4.9        cli_3.6.3        
#> [33] pkgdown_2.1.0     magrittr_2.0.3    digest_0.6.36     grid_4.4.1       
#> [37] rstudioapi_0.16.0 fst_0.9.8         lifecycle_1.0.4   vctrs_0.6.5      
#> [41] evaluate_0.24.0   glue_1.7.0        data.table_1.16.0 ragg_1.3.2       
#> [45] fansi_1.0.6       colorspace_2.1-0  fstcore_0.9.18    rmarkdown_2.27   
#> [49] pkgconfig_2.0.3   tools_4.4.1       htmltools_0.5.8.1