This function computes term frequency–inverse document frequency (tf–idf) on a dataset with one row per term occurrence (or pre-counted). It preserves original column names and returns new columns: - `n`: raw count (computed or user-supplied) - `tf`: term frequency per document - `idf`: inverse document frequency per group (or corpus) - `tf_idf`: tf × idf If `group_col` is `NULL`, all documents are treated as a single group.

bind_tf_idf_dt(.data, group_col = NULL, doc_col, term_col, n_col = NULL)

Arguments

.data

A data.frame or data.table of text data.

group_col

Character name of grouping column, or `NULL` for no grouping.

doc_col

Character name of document identifier column.

term_col

Character name of term/word column.

n_col

(Optional) Character name of pre-counted term-frequency column. If `NULL` (default), counts are computed via `.N`.

Value

A data.table containing: - Original grouping, document, and term columns - `n`, `tf`, `idf`, and `tf_idf`

See also

Examples


# With groups
df <- data.frame(
  category = rep(c("A","B"), each = 6),
  doc_id   = rep(c("d1","d2","d3"), times = 4),
  word     = c("apple","banana","apple","banana","cherry","apple",
               "dog","cat","dog","mouse","cat","dog"),
  stringsAsFactors = FALSE
)
result <- bind_tf_idf_dt(df, "category", "doc_id", "word")
result
#>    category doc_id   word     n    tf       idf    tf_idf
#>      <char> <char> <char> <int> <num>     <num>     <num>
#> 1:        A     d1  apple     1   0.5 0.4054651 0.2027326
#> 2:        A     d2 banana     1   0.5 0.4054651 0.2027326
#> 3:        A     d3  apple     2   1.0 0.4054651 0.4054651
#> 4:        A     d1 banana     1   0.5 0.4054651 0.2027326
#> 5:        A     d2 cherry     1   0.5 1.0986123 0.5493061
#> 6:        B     d1    dog     1   0.5 0.4054651 0.2027326
#> 7:        B     d2    cat     2   1.0 1.0986123 1.0986123
#> 8:        B     d3    dog     2   1.0 0.4054651 0.4054651
#> 9:        B     d1  mouse     1   0.5 1.0986123 0.5493061

# Without groups
df %>%
  filter_dt(category == "A") %>%
  bind_tf_idf_dt(doc_col = "doc_id",term_col = "word")
#>    doc_id   word     n    tf       idf    tf_idf
#>    <char> <char> <int> <num>     <num>     <num>
#> 1:     d1  apple     1   0.5 0.4054651 0.2027326
#> 2:     d2 banana     1   0.5 0.4054651 0.2027326
#> 3:     d3  apple     2   1.0 0.4054651 0.4054651
#> 4:     d1 banana     1   0.5 0.4054651 0.2027326
#> 5:     d2 cherry     1   0.5 1.0986123 0.5493061

# With counts provided
df %>%
  filter_dt(category == "A") %>%
  count_dt() %>%
  bind_tf_idf_dt(doc_col = "doc_id",term_col = "word",n_col = "n")
#>    doc_id   word     n    tf       idf    tf_idf
#>    <char> <char> <int> <num>     <num>     <num>
#> 1:     d3  apple     2   1.0 0.4054651 0.4054651
#> 2:     d1  apple     1   0.5 0.4054651 0.2027326
#> 3:     d2 banana     1   0.5 0.4054651 0.2027326
#> 4:     d1 banana     1   0.5 0.4054651 0.2027326
#> 5:     d2 cherry     1   0.5 1.0986123 0.5493061
df %>%
  count_dt() %>%
  bind_tf_idf_dt(group_col = "category",
                 doc_col = "doc_id",
                 term_col = "word",n_col = "n")
#>    category doc_id   word     n    tf       idf    tf_idf
#>      <char> <char> <char> <int> <num>     <num>     <num>
#> 1:        A     d3  apple     2   1.0 0.4054651 0.4054651
#> 2:        B     d2    cat     2   1.0 1.0986123 1.0986123
#> 3:        B     d3    dog     2   1.0 0.4054651 0.4054651
#> 4:        A     d1  apple     1   0.5 0.4054651 0.2027326
#> 5:        A     d2 banana     1   0.5 0.4054651 0.2027326
#> 6:        A     d1 banana     1   0.5 0.4054651 0.2027326
#> 7:        A     d2 cherry     1   0.5 1.0986123 0.5493061
#> 8:        B     d1    dog     1   0.5 0.4054651 0.2027326
#> 9:        B     d1  mouse     1   0.5 1.0986123 0.5493061