Merge keywords that have common stem or lemma, and return the majority form of the word. This function recieves a tidy table (data.frame) with document ID and keyword waiting to be merged.

keyword_merge(
  dt,
  id = "id",
  keyword = "keyword",
  reduce_form = "lemma",
  lemmatize_dict = NULL,
  stem_lang = "porter"
)

Arguments

dt

A data.frame containing at least two columns with document ID and keyword.

id

Quoted characters specifying the column name of document ID.Default uses "id".

keyword

Quoted characters specifying the column name of keyword.Default uses "keyword".

reduce_form

Merge keywords with the same stem("stem") or lemma("lemma"). See details. Default uses "lemma". Another advanced option is "partof". If a non-unigram (A) is part (subset) of another non-unigram (B), then the longer one(B) would be replaced by the shorter one(A).

lemmatize_dict

A dictionary of base terms and lemmas to use for replacement. Only used when the lemmatize parameter is TRUE. The first column should be the full word form in lower case while the second column is the corresponding replacement lemma. Default uses NULL, this would apply the default dictionary used in lemmatize_strings function. Applicable when reduce_form takes "lemma".

stem_lang

The name of a recognized language. The list of supported languages could be found at getStemLanguages. Applicable when reduce_form takes "stem".

Value

A tbl, namely a tidy table with document ID and merged keyword.

Details

While keyword_clean has provided a robust way to lemmatize the keywords, the returned token might not be the most common way to use.This function first gets the stem or lemma of every keyword using stem_strings or lemmatize_strings from textstem package, then find the most frequent form (if more than 1,randomly select one) for each stem or lemma. Last, every keyword would be replaced by the most frequent keyword which share the same stem or lemma with it.

When the `reduce_form` is set to "partof", then for non-unigrams in the same document, if one non-unigram is the subset of another, then they would be merged into the shorter one, which is considered to be more general (e.g. "time series" and "time series analysis" would be merged into "time series" if they co-occur in the same document). This could reduce the redundant information. This is only applied to multi-word phrases, because using it for one word would oversimplify the token and cause information loss (therefore, "time series" and "time" would not be merged into "time"). This is an advanced option that should be used with caution (A trade-off between information generalization and detailed information retention).

Examples

library(akc)

# \donttest{
bibli_data_table %>%
  keyword_clean(lemmatize = FALSE) %>%
  keyword_merge(reduce_form = "stem")
#> # A tibble: 5,372 × 2
#>       id keyword                      
#>    <int> <chr>                        
#>  1  1163 10.7202/1063788ar            
#>  2   619 18th century                 
#>  3  1154 1password                    
#>  4    81 1science                     
#>  5  1424 2016 us presidential election
#>  6    42 21st-century skills          
#>  7  1114 21st century skills          
#>  8  1051 24-hour opening              
#>  9   662 3d environment               
#> 10  1252 3d reconstruction            
#> # … with 5,362 more rows

bibli_data_table %>%
  keyword_clean(lemmatize = FALSE) %>%
  keyword_merge(reduce_form = "lemma")
#> # A tibble: 5,372 × 2
#>       id keyword                      
#>    <int> <chr>                        
#>  1  1163 10.7202/1063788ar            
#>  2   619 18th century                 
#>  3  1154 1password                    
#>  4    81 1science                     
#>  5   361 second-career librarianship  
#>  6   662 second life                  
#>  7  1424 2016 us presidential election
#>  8    42 21st-century skills          
#>  9  1114 21st century skills          
#> 10  1051 24-hour opening              
#> # … with 5,362 more rows
# }