This vignette displays how to use nesting in tidyfst
. It
has referred to tidyr
s vignette in https://tidyr.tidyverse.org/articles/nest.html. Now
fist, we nest the “mtcars” data.frame by “cyl” column.
library(tidyfst)
#> Thank you for using tidyfst!
#> To acknowledge our work, please cite the package:
#> Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388
# nest by "cyl" column
mtcars_nested <- mtcars %>%
nest_dt(cyl) # you can use "cyl" too, very flexible
# inspect the output data.table
mtcars_nested
#> cyl ndt
#> <num> <list>
#> 1: 6 <data.table[7x10]>
#> 2: 4 <data.table[11x10]>
#> 3: 8 <data.table[14x10]>
Now, we want to do a regression within the nested group “cyl”. We’ll
use the famous lapply
to complete this:
mtcars_nested2 <- mtcars_nested %>%
mutate_dt(model = lapply(ndt,function(df) lm(mpg ~ wt, data = df)))
mtcars_nested2
#> cyl ndt model
#> <num> <list> <list>
#> 1: 6 <data.table[7x10]> <lm[12]>
#> 2: 4 <data.table[11x10]> <lm[12]>
#> 3: 8 <data.table[14x10]> <lm[12]>
We could see that the model is stored in the column “model”. Now, we try to get the fitted value in the model.
mtcars_nested3 <- mtcars_nested2 %>%
mutate_dt(model_predict = lapply(model, predict))
mtcars_nested3$model_predict
#> [[1]]
#> 1 2 3 4 5 6 7
#> 21.12497 20.41604 19.47080 18.78968 18.84528 18.84528 20.70795
#>
#> [[2]]
#> 1 2 3 4 5 6 7 8
#> 26.47010 21.55719 21.78307 27.14774 30.45125 29.20890 25.65128 28.64420
#> 9 10 11
#> 27.48656 31.02725 23.87247
#>
#> [[3]]
#> 1 2 3 4 5 6 7 8
#> 16.32604 16.04103 14.94481 15.69024 15.58061 12.35773 11.97625 12.14945
#> 9 10 11 12 13 14
#> 16.15065 16.33700 15.44907 15.43811 16.91800 16.04103
We could find that the “model_predict” is a list of numeric vectors. Let’s try to unnest the target column “model_predict”.
mtcars_nested3 %>% unnest_dt(model_predict)
#> cyl model_predict
#> <num> <num>
#> 1: 6 21.12497
#> 2: 6 20.41604
#> 3: 6 19.47080
#> 4: 6 18.78968
#> 5: 6 18.84528
#> 6: 6 18.84528
#> 7: 6 20.70795
#> 8: 4 26.47010
#> 9: 4 21.55719
#> 10: 4 21.78307
#> 11: 4 27.14774
#> 12: 4 30.45125
#> 13: 4 29.20890
#> 14: 4 25.65128
#> 15: 4 28.64420
#> 16: 4 27.48656
#> 17: 4 31.02725
#> 18: 4 23.87247
#> 19: 8 16.32604
#> 20: 8 16.04103
#> 21: 8 14.94481
#> 22: 8 15.69024
#> 23: 8 15.58061
#> 24: 8 12.35773
#> 25: 8 11.97625
#> 26: 8 12.14945
#> 27: 8 16.15065
#> 28: 8 16.33700
#> 29: 8 15.44907
#> 30: 8 15.43811
#> 31: 8 16.91800
#> 32: 8 16.04103
#> cyl model_predict
This process would remove all the other list column automatically. For instance, in our case, the column “ndt” is removed.