Title: | An Automated Cleaning Tool for Semantic and Linguistic Data |
---|---|
Description: | Implements several functions that automates the cleaning and spell-checking of text data. Also converges, finalizes, removes plurals and continuous strings, and puts text data in binary format for semantic network analysis. Uses the 'SemNetDictionaries' package to make the cleaning process more accurate, efficient, and reproducible. |
Authors: | Alexander P. Christensen [aut, cre] |
Maintainer: | Alexander P. Christensen <[email protected]> |
License: | GPL (>= 3.0) |
Version: | 1.3.6 |
Built: | 2024-11-02 04:13:14 UTC |
Source: | https://github.com/alexchristensen/semnetcleaner |
Implements several functions that automates the cleaning and
spell-checking of text data. Also converges, finalizes, removes plurals and
continuous strings, and puts text data in binary format for semantic network analysis.
Uses the SemNetDictionaries
package to make
the cleaning process more accurate, efficient, and reproducible.
Alexander Christensen <[email protected]>
Useful links:
Report bugs at https://github.com/AlexChristensen/SemNetCleaner/issues
A wrapper function to determine whether responses are good or bad.
Bad responses are replaced with missing (NA
). Good responses are returned.
bad.response(word, ...)
bad.response(word, ...)
word |
Character. A word to be tested for whether it is bad |
... |
Vector. Additional responses to be considered bad |
If response is bad, then returns NA
.
If response is valid, then returns the response
Alexander Christensen <[email protected]>
# Bad response bad.response(word = " ") # Good response bad.response(word = "hello") # Make a good response bad bad.response(word = "hello","hello") # Add additional bad responses bad.response(word = "hello", c("hello","world"))
# Bad response bad.response(word = " ") # Good response bad.response(word = "hello") # Make a good response bad bad.response(word = "hello","hello") # Add additional bad responses bad.response(word = "hello", c("hello","world"))
A wrapper function for the best guess of a spelling mistake
based on the letters, the ordering of those letters, and the potential
for letters to be interchanged. The
Damerau-Levenshtein distance
is used to guide inferences into what word the participant was trying to spell from a dictionary
(see SemNetDictionaries
)
best.guess(word, full.dictionary, dictionary = NULL, tolerance = 1)
best.guess(word, full.dictionary, dictionary = NULL, tolerance = 1)
word |
Character. A word to get best guess spelling options from dictionary |
full.dictionary |
Character vector.
The dictionary to search for best guesses in.
See |
dictionary |
Character.
A dictionary from |
tolerance |
Numeric.
The distance tolerance set for automatic spell-correction purposes.
This function uses the function Unique words (i.e., n = 1) that are within the (distance) tolerance are automatically output as best guess responses. This default is based on Damerau's (1964) proclamation that more than 80% of all human misspellings can be expressed by a single error (e.g., insertion, deletion, substitution, and transposition). If there is more than one word that is within or below the distance tolerance, then these will be provided as potential options. The recommended and default distance tolerance is |
The best guess(es) of the word
Alexander Christensen <[email protected]>
Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 171-176.
# Misspelled "bombay" best.guess("bomba", full.dictionary = SemNetDictionaries::animals.dictionary)
# Misspelled "bombay" best.guess("bomba", full.dictionary = SemNetDictionaries::animals.dictionary)
Converts the binary response matrix into characters for each participant
bin2resp(rmat, to.data.frame = FALSE)
bin2resp(rmat, to.data.frame = FALSE)
rmat |
Binary matrix. A binarized response matrix of verbal fluency or linguistic data |
to.data.frame |
Boolean.
Should output be a data frame where participants are columns?
Defaults to |
A list containing objects for each participant and their responses
Alexander Christensen <[email protected]>
# Toy example raw <- open.animals[c(1:10),-c(1:3)] if(interactive()) { # Clean and prepocess data clean <- textcleaner(open.animals[,-c(1:2)], partBY = "row", dictionary = "animals") # Change binary response matrix to word response matrix charmat <- bin2resp(clean$responses$binary) }
# Toy example raw <- open.animals[c(1:10),-c(1:3)] if(interactive()) { # Clean and prepocess data clean <- textcleaner(open.animals[,-c(1:2)], partBY = "row", dictionary = "animals") # Change binary response matrix to word response matrix charmat <- bin2resp(clean$responses$binary) }
textcleaner
object
to a SNAFU GUI formatConverts textcleaner
object
to a SNAFU GUI format (only works for fluency data)
convert2snafu(..., category)
convert2snafu(..., category)
... |
Matrix or data frame. A clean response matrices |
category |
Character. Category of verbal fluency data |
The format of the file has 7 columns:
idDefaults to the row names of the inputted data
listnumThe list number for the fluency category. Defaults to 0. Future implementations will allow more lists
categoryThe verbal fluency category that is input into the
category
argument
itemThe verbal fluency responses for every participant
RTResponse time. Currently not implemented. Defaults to 0
RTstartStart of response time. Currently not implemented. Defaults to 0
groupNames of groups. Defaults to the names of the objects input into
the function (...
)
A .csv file formatted for SNAFU
Alexander Christensen <[email protected]>
# For SNAFU, see: Zemla, J. C., Cao, K., Mueller, K. D., & Austerweil, J. L. (2020). SNAFU: The Semantic Network and Fluency Utility. Behavior Research Methods, 1-19. https://doi.org/10.3758/s13428-019-01343-w
# Convert data to SNAFU if(interactive()) {convert2snafu(open.clean, category = "animals")}
# Convert data to SNAFU if(interactive()) {convert2snafu(open.clean, category = "animals")}
A vector corresponding the frequency of letters across 40,000 words. Retrieved from: http://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
data(letter.freq)
data(letter.freq)
letter.freq (26-element numeric vector)
data("letter.freq")
data("letter.freq")
Raw Animals verbal fluency data (n = 516) from Christensen et al. (2018).
data(open.animals)
data(open.animals)
open.animals (matrix 516 x 38)
First column is a grouping variable ("Group"
) with 1
corresponding
to low openness to experience and 2
to high openness to experience
Second column is the latent variable of openness to experience with Intellect items removed (see Christensen et al., 2018 for more details).
Third column is the ID variable for each participant.
Columns 4-38 are raw fluency data.
Christensen, A. P., Kenett, Y. N., Cotter, K. N., Beaty, R. E., & Silvia, P. J. (2018). Remotely close associations: Openness to experience and semantic memory structure. European Journal of Personality, 32, 480-492.
data("open.animals")
data("open.animals")
Cleaned response matrices for the Animals verbal fluency data (n = 516) from Christensen et al. (2018).
data(open.clean)
data(open.clean)
open.clean (matrix, 516 x 35)
Christensen, A. P., Kenett, Y. N., Cotter, K. N., Beaty, R. E., & Silvia, P. J. (2018). Remotely close associations: Openness to experience and semantic memory structure. European Journal of Personality, 32, 480-492.
data("open.clean")
data("open.clean")
textcleaner
Object (Openness and Verbal Fluency)Preprocessed textcleaner
object for the Animals verbal fluency data (n = 516)
from Christensen and Kenett (2020).
data(open.preprocess)
data(open.preprocess)
open.preprocess (list, length = 4)
Christensen, A. P., & Kenett, Y. N. (2020). Semantic network analysis (SemNA): A tutorial on preprocessing, estimating, and analyzing semantic networks. PsyArxiv.
data("open.preprocess")
data("open.preprocess")
A function to change words to their plural form. The rules for converting words to their plural forms are based on the grammar rules found here: https://www.grammarly.com/blog/plural-nouns/. This function handles most special cases and some irregular cases (see examples) but caution is necessary. If no plural form is identified, then the original word is returned.
pluralize(word)
pluralize(word)
word |
A word |
Returns the word in singular form, unless a plural form could not be found (then the original word is returned)
Alexander Christensen <[email protected]>
# Handles any prototypical cases "dogs" pluralize("dog") "foxes" pluralize("fox") "wolves" pluralize("wolf") "octopi" pluralize("octopus") "taxa" pluralize("taxon") # And most special cases: "wives" pluralize("wife") "roofs" pluralize("roof") "photos" pluralize("photo") # And some irregular cases: "children" pluralize("child") "teeth" pluralize("tooth") "mice" pluralize("mouse")
# Handles any prototypical cases "dogs" pluralize("dog") "foxes" pluralize("fox") "wolves" pluralize("wolf") "octopi" pluralize("octopus") "taxa" pluralize("taxon") # And most special cases: "wives" pluralize("wife") "roofs" pluralize("roof") "photos" pluralize("photo") # And some irregular cases: "children" pluralize("child") "teeth" pluralize("tooth") "mice" pluralize("mouse")
Computes QWERTY Distance for words that have the same number of characters. Distance is computed based on the number of keys a character is away from another character on a QWERTY keyboard
qwerty.dist(wordA, wordB)
qwerty.dist(wordA, wordB)
wordA |
Character vector. Word to be compared |
wordB |
Character vector. Word to be compared |
Numeric value for distance between wordA
and wordB
Alexander Christensen <[email protected]>
#Identical values for Damerau-Levenshtein stringdist::stringdist("big", "pig", method="dl") stringdist::stringdist("big", "bug", method="dl") #Different distances for QWERTY qwerty.dist("big", "pig") qwerty.dist("big", "bug") # Probably meant to type "bug"
#Identical values for Damerau-Levenshtein stringdist::stringdist("big", "pig", method="dl") stringdist::stringdist("big", "bug", method="dl") #Different distances for QWERTY qwerty.dist("big", "pig") qwerty.dist("big", "bug") # Probably meant to type "bug"
A single function to read in common data file extensions. Note that this function is specialized for reading in text data in the format necessary for functions in SemNetCleaner
File extensions supported:
.Rdata
.rds
.csv
.xlsx
.xls
.sav
.txt
.mat
.dat
read.data(file = file.choose(), header = TRUE, sep = ",", ...)
read.data(file = file.choose(), header = TRUE, sep = ",", ...)
file |
Character.
A path to the file to load.
Defaults to interactive file selection using |
header |
Boolean.
A logical value indicating whether the file contains the
names of the variables as its first line.
If missing, the value is determined from the file format:
header is set to |
sep |
Character.
The field separator character.
Values on each line of the file are separated by this character.
If sep = "" (the default for |
... |
Additional arguments. Allows for additional arguments to be passed onto the respective read functions. See documentation in the list below:
|
A data frame containing a representation of the data in the file. If file extension is ".Rdata", then data will be read to the global environment
Alexander Christensen <[email protected]>
# R Core Team
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
# readxl
Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
# R.matlab
Henrik Bengtsson (2018). R.matlab: Read and Write MAT Files and Call MATLAB from Within R. R package version 3.6.2. https://CRAN.R-project.org/package=R.matlab
# Use this example for your data if(interactive()) {read.data()} # Example for CRAN tests ## Create test data test1 <- c(1:5, "6,7", "8,9,10") ## Path to temporary file tf <- tempfile() ## Create test file writeLines(test1, tf) ## Read in data read.data(tf) # See documentation of respective R functions for specific examples
# Use this example for your data if(interactive()) {read.data()} # Example for CRAN tests ## Create test data test1 <- c(1:5, "6,7", "8,9,10") ## Path to temporary file tf <- tempfile() ## Create test file writeLines(test1, tf) ## Read in data read.data(tf) # See documentation of respective R functions for specific examples
Converts the response matrix to binary response matrix
resp2bin(resp)
resp2bin(resp)
resp |
Response matrix. A response matrix of verbal fluency or linguistic data |
A list containing objects for each participant and their responses
Alexander Christensen <[email protected]>
# Toy example raw <- open.animals[c(1:10),-c(1:3)] if(interactive()) { # Clean and prepocess data clean <- textcleaner(open.animals[,-c(1:2)], partBY = "row", dictionary = "animals") # Change response matrix to binary response matrix binmat <- resp2bin(clean$responses$corrected) }
# Toy example raw <- open.animals[c(1:10),-c(1:3)] if(interactive()) { # Clean and prepocess data clean <- textcleaner(open.animals[,-c(1:2)], partBY = "row", dictionary = "animals") # Change response matrix to binary response matrix binmat <- resp2bin(clean$responses$corrected) }
A function to change words to their singular form. The rules for converting words to their singular forms are based on the inverse of the grammar rules found here: https://www.grammarly.com/blog/plural-nouns/. This function handles most special cases and some irregular cases (see examples) but caution is necessary. If no singular form is identified, then the original word is returned.
singularize(word, dictionary = TRUE)
singularize(word, dictionary = TRUE)
word |
Character. A word |
dictionary |
Boolean.
Should dictionary be used to verify word exists?
Default to |
Returns the word in singular form, unless a singular form could not be found (then the original word is returned)
Alexander Christensen <[email protected]>
# Handles any prototypical cases # "dog" singularize("dogs") # "fox" singularize("foxes") # "wolf" singularize("wolves") # "octopus" singularize("octopi") # "taxon" singularize("taxa") # And most special cases: # "wife" singularize("wives") # "fez" singularize("fezzes") # "roof" singularize("roofs") # "photo" singularize("photos") # And some irregular cases: # "child" singularize("children") # "tooth" singularize("teeth") # "mouse" singularize("mice")
# Handles any prototypical cases # "dog" singularize("dogs") # "fox" singularize("foxes") # "wolf" singularize("wolves") # "octopus" singularize("octopi") # "taxon" singularize("taxa") # And most special cases: # "wife" singularize("wives") # "fez" singularize("fezzes") # "roof" singularize("roofs") # "photo" singularize("photos") # And some irregular cases: # "child" singularize("children") # "tooth" singularize("teeth") # "mouse" singularize("mice")
An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data
textcleaner( data = NULL, type = c("fluency", "free"), miss = 99, partBY = c("row", "col"), dictionary = NULL, spelling = c("UK", "US"), add.path = NULL, keepStrings = FALSE, allowPunctuations, allowNumbers = FALSE, lowercase = TRUE, keepLength = NULL, keepCue = FALSE, continue = NULL )
textcleaner( data = NULL, type = c("fluency", "free"), miss = 99, partBY = c("row", "col"), dictionary = NULL, spelling = c("UK", "US"), add.path = NULL, keepStrings = FALSE, allowPunctuations, allowNumbers = FALSE, lowercase = TRUE, keepLength = NULL, keepCue = FALSE, continue = NULL )
data |
Matrix or data frame. For
For
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
type |
Character vector. Type of task to be preprocessed.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
miss |
Numeric or character.
Value for missing data.
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
partBY |
Character.
Are participants by row or column?
Set to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
dictionary |
Character vector.
Can be a vector of a corpus or any text for comparison.
Dictionary to be used for more efficient text cleaning.
Defaults to Use |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
spelling |
Character vector. English spelling to be used.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
add.path |
Character.
Path to additional dictionaries to be found.
DOES NOT search recursively (through all folders in path)
to avoid time intensive search.
Set to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
keepStrings |
Boolean.
Should strings be retained or separated?
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
allowPunctuations |
Character vector.
Allows punctuation characters to be included in responses.
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
allowNumbers |
Boolean.
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
lowercase |
Boolean.
Should words be converted to lowercase?
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
keepLength |
Numeric.
Maximum number of words allowed in a response.
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
keepCue |
Boolean.
Should cue words be retained in the responses?
Defaults to |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
continue |
List.
A result previously unfinished that still needs to be completed.
Allows you to continue to manually spell-check their data
after you've closed or errored out.
Defaults to |
This function returns a list containing the following objects:
binary |
A matrix of responses where each row represents a participant
and each column represents a unique response. A response that a participant has provided is a ' |
responses |
A list containing two objects:
|
spellcheck |
A list containing three objects:
|
removed |
A list containing two objects:
|
partChanges |
A list where each participant is a list index with each
response that was been changed. Participants are identified by their ID (see argument |
Alexander Christensen <[email protected]>
Christensen, A. P., & Kenett, Y. N. (in press). Semantic network analysis (SemNA): A tutorial on preprocessing, estimating, and analyzing semantic networks. Psychological Methods.
Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28.
# Toy example raw <- open.animals[c(1:10),-c(1:3)] if(interactive()) { #Full test clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals") }
# Toy example raw <- open.animals[c(1:10),-c(1:3)] if(interactive()) { #Full test clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals") }