When we have raw text like abstract or article but not keywords, we might prefer extracting keywords first. The least prerequisite data to be provided are a data.frame with document id and raw text, and a user defined dictionary should be provided. One could use make_dict function to construct his(her) own dictionary with a character vector containing the vocabularies. If the dictionary is not provided, the function would return all the ngram tokens without filtering (not recommended).

keyword_extract(
  dt,
  id = "id",
  text,
  dict = NULL,
  stopword = NULL,
  n_max = 4,
  n_min = 1
)

Arguments

dt

A data.frame containing at least two columns with document ID and text strings for extraction.

id

Quoted characters specifying the column name of document ID.Default uses "id".

text

Quoted characters specifying the column name of raw text for extraction.

dict

A data.table with two columns,namely "id" and "keyword"(set as key). This should be exported by make_dict function. The default uses NULL, which means the output keywords are not filtered by the dictionary (usually not recommended).

stopword

A vector containing the stop words to be used. Default uses NULL.

n_max

The number of words in the n-gram. This must be an integer greater than or equal to 1. Default uses 4.

n_min

This must be an integer greater than or equal to 1, and less than or equal to n_max. Default uses 1.

Value

A data.frame(tibble) with two columns, namely document ID and extracted keyword.

Details

In the procedure of keyword extraction from akc,first the raw text would be split into independent clause (namely split by puctuations of [,;!?.]). Then the ngrams of the clauses would be extracted. Finally, the phrases represented by ngrams should be in the dictionary created by the user (using make_dict).The user could also specify the n of ngrams.

This function could take some time if the sample size is large, it is suggested to use system.time to do some test first. Nonetheless, it has been optimized by data.table codes already and has good performance for big data.

See also

Examples


 library(akc)
 library(dplyr)

  bibli_data_table %>%
    keyword_clean(id = "id",keyword = "keyword") %>%
    pull(keyword) %>%
    make_dict -> my_dict

  tidytext::stop_words %>%
    pull(word) %>%
    unique() -> my_stopword

 # \donttest{
  bibli_data_table %>%
    keyword_extract(id = "id",text = "abstract",
    dict = my_dict,stopword = my_stopword)
#> Joining, by = "keyword"
#> # A tibble: 25,140 × 2
#>       id keyword                      
#>    <int> <chr>                        
#>  1   619 18th century                 
#>  2  1223 18th century                 
#>  3  1154 1password                    
#>  4    81 1science                     
#>  5   983 1science                     
#>  6    15 2016 us presidential election
#>  7   662 3d environment               
#>  8   624 55th library week            
#>  9   747 aasl standards               
#> 10   284 aboriginal                   
#> # … with 25,130 more rows
 # }