R/bind_tf_idf_dt.R
bind_tf_idf_dt.Rd
This function computes term frequency–inverse document frequency (tf–idf) on a dataset with one row per term occurrence (or pre-counted). It preserves original column names and returns new columns: - `n`: raw count (computed or user-supplied) - `tf`: term frequency per document - `idf`: inverse document frequency per group (or corpus) - `tf_idf`: tf × idf If `group_col` is `NULL`, all documents are treated as a single group.
bind_tf_idf_dt(.data, group_col = NULL, doc_col, term_col, n_col = NULL)
A data.frame or data.table of text data.
Character name of grouping column, or `NULL` for no grouping.
Character name of document identifier column.
Character name of term/word column.
(Optional) Character name of pre-counted term-frequency column. If `NULL` (default), counts are computed via `.N`.
A data.table containing: - Original grouping, document, and term columns - `n`, `tf`, `idf`, and `tf_idf`
# With groups
df <- data.frame(
category = rep(c("A","B"), each = 6),
doc_id = rep(c("d1","d2","d3"), times = 4),
word = c("apple","banana","apple","banana","cherry","apple",
"dog","cat","dog","mouse","cat","dog"),
stringsAsFactors = FALSE
)
result <- bind_tf_idf_dt(df, "category", "doc_id", "word")
result
#> category doc_id word n tf idf tf_idf
#> <char> <char> <char> <int> <num> <num> <num>
#> 1: A d1 apple 1 0.5 0.4054651 0.2027326
#> 2: A d2 banana 1 0.5 0.4054651 0.2027326
#> 3: A d3 apple 2 1.0 0.4054651 0.4054651
#> 4: A d1 banana 1 0.5 0.4054651 0.2027326
#> 5: A d2 cherry 1 0.5 1.0986123 0.5493061
#> 6: B d1 dog 1 0.5 0.4054651 0.2027326
#> 7: B d2 cat 2 1.0 1.0986123 1.0986123
#> 8: B d3 dog 2 1.0 0.4054651 0.4054651
#> 9: B d1 mouse 1 0.5 1.0986123 0.5493061
# Without groups
df %>%
filter_dt(category == "A") %>%
bind_tf_idf_dt(doc_col = "doc_id",term_col = "word")
#> doc_id word n tf idf tf_idf
#> <char> <char> <int> <num> <num> <num>
#> 1: d1 apple 1 0.5 0.4054651 0.2027326
#> 2: d2 banana 1 0.5 0.4054651 0.2027326
#> 3: d3 apple 2 1.0 0.4054651 0.4054651
#> 4: d1 banana 1 0.5 0.4054651 0.2027326
#> 5: d2 cherry 1 0.5 1.0986123 0.5493061
# With counts provided
df %>%
filter_dt(category == "A") %>%
count_dt() %>%
bind_tf_idf_dt(doc_col = "doc_id",term_col = "word",n_col = "n")
#> doc_id word n tf idf tf_idf
#> <char> <char> <int> <num> <num> <num>
#> 1: d3 apple 2 1.0 0.4054651 0.4054651
#> 2: d1 apple 1 0.5 0.4054651 0.2027326
#> 3: d2 banana 1 0.5 0.4054651 0.2027326
#> 4: d1 banana 1 0.5 0.4054651 0.2027326
#> 5: d2 cherry 1 0.5 1.0986123 0.5493061
df %>%
count_dt() %>%
bind_tf_idf_dt(group_col = "category",
doc_col = "doc_id",
term_col = "word",n_col = "n")
#> category doc_id word n tf idf tf_idf
#> <char> <char> <char> <int> <num> <num> <num>
#> 1: A d3 apple 2 1.0 0.4054651 0.4054651
#> 2: B d2 cat 2 1.0 1.0986123 1.0986123
#> 3: B d3 dog 2 1.0 0.4054651 0.4054651
#> 4: A d1 apple 1 0.5 0.4054651 0.2027326
#> 5: A d2 banana 1 0.5 0.4054651 0.2027326
#> 6: A d1 banana 1 0.5 0.4054651 0.2027326
#> 7: A d2 cherry 1 0.5 1.0986123 0.5493061
#> 8: B d1 dog 1 0.5 0.4054651 0.2027326
#> 9: B d1 mouse 1 0.5 1.0986123 0.5493061