Example 3: Reshape • tidyfst

When I design longer_dt and wider_dt, I could find the pivot_longer and pivot_wider in tidyr and melt and dcast in data.table. Still, designing this API is not easy, as my goal is to let users use it with least pain. Here we would try to reproduce the results in the vignette of tidyr(https://tidyr.tidyverse.org/articles/pivot.html). First load the packages:

library(tidyfst)
#> Thank you for using tidyfst!
#> To acknowledge our work, please cite the package:
#> Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388
library(tidyr)

Longer

First inspect the data:

relig_income
#> # A tibble: 18 × 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
#>  1 Agnostic      27        34        60        81        76       137        122
#>  2 Atheist       12        27        37        52        35        70         73
#>  3 Buddhist      27        21        30        34        33        58         62
#>  4 Catholic     418       617       732       670       638      1116        949
#>  5 Don’t k…      15        14        15        11        10        35         21
#>  6 Evangel…     575       869      1064       982       881      1486        949
#>  7 Hindu          1         9         7         9        11        34         47
#>  8 Histori…     228       244       236       238       197       223        131
#>  9 Jehovah…      20        27        24        24        21        30         15
#> 10 Jewish        19        19        25        25        30        95         69
#> 11 Mainlin…     289       495       619       655       651      1107        939
#> 12 Mormon        29        40        48        51        56       112         85
#> 13 Muslim         6         7         9        10         9        23         16
#> 14 Orthodox      13        17        23        32        32        47         38
#> 15 Other C…       9         7        11        13        13        14         18
#> 16 Other F…      20        33        40        46        49        63         46
#> 17 Other W…       5         2         3         4         2         7          3
#> 18 Unaffil…     217       299       374       365       341       528        407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> #   `Don't know/refused` <dbl>

In tidyr, to get the longer format you need:

relig_income %>% 
  pivot_longer(-religion, names_to = "income", values_to = "count")

In tidyfst, we have:

relig_income %>% 
  longer_dt("religion",name = "income",value = "count")
#>                   religion             income count
#>                     <char>             <fctr> <num>
#>   1:              Agnostic              <$10k    27
#>   2:               Atheist              <$10k    12
#>   3:              Buddhist              <$10k    27
#>   4:              Catholic              <$10k   418
#>   5:    Don’t know/refused              <$10k    15
#>  ---                                               
#> 176:              Orthodox Don't know/refused    73
#> 177:       Other Christian Don't know/refused    18
#> 178:          Other Faiths Don't know/refused    71
#> 179: Other World Religions Don't know/refused     8
#> 180:          Unaffiliated Don't know/refused   597

Another example from tidyr:

billboard

# tidyr way:
 billboard %>%
   pivot_longer(
     cols = starts_with("wk"),
     names_to = "week",
     values_to = "rank",
     values_drop_na = TRUE
   )

# tidyfst way:
billboard %>% 
  longer_dt(-"wk",
            name = "week",
            value = "rank",
            na.rm = TRUE
            )
# regex could select the groups to keep, and minus could select the reverse

A warning would could come out because the merging column has different data types and do the coercion automatically.

Wider

## data
fish_encounters
#> # A tibble: 114 × 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # ℹ 104 more rows

## tidyr way:
fish_encounters %>% 
  pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 × 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1     1
#>  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
#>  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
#>  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
#>  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#>  8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
#>  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
#> 12 4857        1     1      1     1       1     1     1     1     1    NA    NA
#> 13 4858        1     1      1     1       1     1     1     1     1     1     1
#> 14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
#> 15 4861        1     1      1     1       1     1     1     1     1     1     1
#> 16 4862        1     1      1     1       1     1     1     1     1    NA    NA
#> 17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
#> 19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

## tidyfst way:
fish_encounters %>% 
  wider_dt(name = "station",value = "seen")
#> Key: <fish>
#>       fish Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2
#>     <fctr>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int>
#>  1:   4842       1     1      1     1       1     1     1     1     1
#>  2:   4843       1     1      1     1       1     1     1     1     1
#>  3:   4844       1     1      1     1       1     1     1     1     1
#>  4:   4845       1     1      1     1       1    NA    NA    NA    NA
#>  5:   4847       1     1      1    NA      NA    NA    NA    NA    NA
#>  6:   4848       1     1      1     1      NA    NA    NA    NA    NA
#>  7:   4849       1     1     NA    NA      NA    NA    NA    NA    NA
#>  8:   4850       1     1     NA     1       1     1     1    NA    NA
#>  9:   4851       1     1     NA    NA      NA    NA    NA    NA    NA
#> 10:   4854       1     1     NA    NA      NA    NA    NA    NA    NA
#> 11:   4855       1     1      1     1       1    NA    NA    NA    NA
#> 12:   4857       1     1      1     1       1     1     1     1     1
#> 13:   4858       1     1      1     1       1     1     1     1     1
#> 14:   4859       1     1      1     1       1    NA    NA    NA    NA
#> 15:   4861       1     1      1     1       1     1     1     1     1
#> 16:   4862       1     1      1     1       1     1     1     1     1
#> 17:   4863       1     1     NA    NA      NA    NA    NA    NA    NA
#> 18:   4864       1     1     NA    NA      NA    NA    NA    NA    NA
#> 19:   4865       1     1      1    NA      NA    NA    NA    NA    NA
#> 2 variables not shown: [MAE <int>, MAW <int>]

# if no keeped groups are selected, use all except for name and value columns

If you want to fill with 0s, use:

fish_encounters %>% 
  wider_dt(name = "station",value = "seen",fill = 0)
#> Key: <fish>
#>       fish Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2
#>     <fctr>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int>
#>  1:   4842       1     1      1     1       1     1     1     1     1
#>  2:   4843       1     1      1     1       1     1     1     1     1
#>  3:   4844       1     1      1     1       1     1     1     1     1
#>  4:   4845       1     1      1     1       1     0     0     0     0
#>  5:   4847       1     1      1     0       0     0     0     0     0
#>  6:   4848       1     1      1     1       0     0     0     0     0
#>  7:   4849       1     1      0     0       0     0     0     0     0
#>  8:   4850       1     1      0     1       1     1     1     0     0
#>  9:   4851       1     1      0     0       0     0     0     0     0
#> 10:   4854       1     1      0     0       0     0     0     0     0
#> 11:   4855       1     1      1     1       1     0     0     0     0
#> 12:   4857       1     1      1     1       1     1     1     1     1
#> 13:   4858       1     1      1     1       1     1     1     1     1
#> 14:   4859       1     1      1     1       1     0     0     0     0
#> 15:   4861       1     1      1     1       1     1     1     1     1
#> 16:   4862       1     1      1     1       1     1     1     1     1
#> 17:   4863       1     1      0     0       0     0     0     0     0
#> 18:   4864       1     1      0     0       0     0     0     0     0
#> 19:   4865       1     1      1     0       0     0     0     0     0
#> 2 variables not shown: [MAE <int>, MAW <int>]

Note that the parameter of name and value should always be provided and should be explicit called (with the parameter names attached).

More complicated example

This example comes from data.table (https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html), and has been used in tidyr too. We’ll try to do it in tidyfst in this example. If we have a data.frame as below:

family <- fread("family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1         30 1998-11-26 2000-01-29         NA             1             2            NA
2         27 1996-06-22         NA         NA             2            NA            NA
3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
5         29 2000-12-05 2005-02-28         NA             2             1            NA")

family
#>    family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
#>        <int>      <int>     <IDat>     <IDat>     <IDat>         <int>
#> 1:         1         30 1998-11-26 2000-01-29       <NA>             1
#> 2:         2         27 1996-06-22       <NA>       <NA>             2
#> 3:         3         26 2002-07-11 2004-04-05 2007-09-02             2
#> 4:         4         32 2004-10-10 2009-08-27 2012-07-21             1
#> 5:         5         29 2000-12-05 2005-02-28       <NA>             2
#> 2 variables not shown: [gender_child2 <int>, gender_child3 <int>]

And want to reshape the data.table to be like this:

#     family_id age_mother  child        dob gender
#         <int>      <int> <char>     <char> <char>
#  1:         1         30 child1 1998-11-26      1
#  2:         1         30 child2 2000-01-29      2
#  3:         1         30 child3       <NA>   <NA>
#  4:         2         27 child1 1996-06-22      2
#  5:         2         27 child2       <NA>   <NA>
#  6:         2         27 child3       <NA>   <NA>
#  7:         3         26 child1 2002-07-11      2
#  8:         3         26 child2 2004-04-05      2
#  9:         3         26 child3 2007-09-02      1
# 10:         4         32 child1 2004-10-10      1
# 11:         4         32 child2 2009-08-27      1
# 12:         4         32 child3 2012-07-21      1
# 13:         5         29 child1 2000-12-05      2
# 14:         5         29 child2 2005-02-28      1
# 15:         5         29 child3       <NA>   <NA>

The data.table::dcast and tidyr::pivot_longer could transfer it in one step, however, not so easy to understand. Here we’ll do it step by step to see what actually happens in this transfer.

family %>% 
  longer_dt(1:2) %>% 
  separate_dt("name",into = c("class","child")) %>% 
  wider_dt(-"class|value",
           name = "class",
           value = "value")
#> Key: <family_id, age_mother, child>
#>     family_id age_mother  child        dob     gender
#>         <int>      <int> <char>     <IDat>     <IDat>
#>  1:         1         30 child1 1998-11-26 1970-01-02
#>  2:         1         30 child2 2000-01-29 1970-01-03
#>  3:         1         30 child3       <NA>       <NA>
#>  4:         2         27 child1 1996-06-22 1970-01-03
#>  5:         2         27 child2       <NA>       <NA>
#>  6:         2         27 child3       <NA>       <NA>
#>  7:         3         26 child1 2002-07-11 1970-01-03
#>  8:         3         26 child2 2004-04-05 1970-01-03
#>  9:         3         26 child3 2007-09-02 1970-01-02
#> 10:         4         32 child1 2004-10-10 1970-01-02
#> 11:         4         32 child2 2009-08-27 1970-01-02
#> 12:         4         32 child3 2012-07-21 1970-01-02
#> 13:         5         29 child1 2000-12-05 1970-01-03
#> 14:         5         29 child2 2005-02-28 1970-01-02
#> 15:         5         29 child3       <NA>       <NA>

In such a process, we could find that we actually get a longer table, then separate it, and wider it later. tidyfst is not going to support the complicated transfer in one step, because it might be easier to implement, but much harder to understand 3 procedures in 1 step. If you still prefer that way, use data.table::dcast and tidyr::pivot_longer instead.