Carry out several keyword cleaning processes automatically and return a tidy table with document ID and keywords.
keyword_clean(
df,
id = "id",
keyword = "keyword",
sep = ";",
rmParentheses = TRUE,
rmNumber = TRUE,
lemmatize = FALSE,
lemmatize_dict = NULL
)
A data.frame containing at least two columns with document ID and keyword strings with separators.
Quoted characters specifying the column name of document ID.Default uses "id".
Quoted characters specifying the column name of keywords.Default uses "keyword".
Separator(s) of keywords. Default uses ";".
Remove the contents in the parentheses (including the parentheses) or not. Default uses TRUE.
Remove the pure number sequence or no. Default uses TRUE.
Lemmatize the keywords or not. Lemmatization is supported by `lemmatize_strings` function in `textstem` package.Default uses FALSE.
A dictionary of base terms and lemmas to use for replacement.
Only used when the lemmatize parameter is TRUE
.
The first column should be the full word form in lower case
while the second column is the corresponding replacement lemma.
Default uses NULL
, this would apply the default dictionary used in
lemmatize_strings
function.
A tbl with two columns, namely document ID and cleaned keywords.
The entire cleaning processes include:
1.Split the text with separators;
2.Remove the contents in the parentheses (including the parentheses);
3.Remove white spaces from start and end of string and reduces repeated white spaces inside a string;
4.Remove all the null character string and pure number sequences;
5.Convert all letters to lower case;
6.Lemmatization.
Some of the procedures could be suppressed or activated with parameter adjustments.
Default setting did not use lemmatization, it is suggested to use keyword_merge
to
merge the keywords afterward.
library(akc)
bibli_data_table
#> # A tibble: 1,448 × 4
#> id title keyword abstr…¹
#> <int> <chr> <chr> <chr>
#> 1 1 Keeping the doors open in an age of austerity? Qualita… Auster… "Engli…
#> 2 2 Comparison of Slovenian and Korean library laws Compar… "This …
#> 3 3 Analysis of the factors affecting volunteering, satisf… Contin… "This …
#> 4 4 Redefining Library and Information Science education a… Curric… "The p…
#> 5 5 Can in-house use data of print collections shed new li… Check-… "Libra…
#> 6 6 Practices of community representatives in exploiting i… Commun… "The p…
#> 7 7 Exploring Becoming, Doing, and Relating within the inf… Librar… "Profe…
#> 8 8 Predictors of burnout in public library employees Emotio… "Work …
#> 9 9 The Roma and documentary film: Considerations for coll… Academ… "This …
#> 10 10 Mediation effect of knowledge management on the relati… Job pe… "This …
#> # … with 1,438 more rows, and abbreviated variable name ¹abstract
bibli_data_table %>%
keyword_clean(id = "id",keyword = "keyword")
#> # A tibble: 5,378 × 2
#> id keyword
#> <int> <chr>
#> 1 1 austerity
#> 2 1 community capacity
#> 3 1 library professional
#> 4 1 public libraries
#> 5 1 public service delivery
#> 6 1 volunteer relationship management
#> 7 1 volunteering
#> 8 2 comparative librarianship
#> 9 2 korea
#> 10 2 library legislation
#> # … with 5,368 more rows