The SemNetCleaner package houses several functions for the cleaning and preprocessing of semantic data. The purpose of this package is to facilitate efficient and reproducible preprocessing of semantic data. Notably, other R packages perform similar functions (e.g., spell-checking, text mining) such as hunspell (ooms2018hunspell?), qdap (rinker2019qdap?), and tm (feinerer2008tm?). However, the SemNetCleaner package sets itself apart from these other packages by focusing specifically on commonly used tasks for SemNA (e.g., verbal fluency), which allows for greater automation of data cleaning and preprocessing.
The SemNetCleaner package applies several steps to preprocess raw verbal fluency data so that it is ready to be used for estimating semantic networks. These steps include spell-checking, verifying the accuracy of the spell-check, and obtaining a binary response matrix for network estimation. To initialize this process, the following code must be run:
# Run 'textcleaner'
clean <- textcleaner(data = open.animals[,-c(1:2)], miss = 99,
partBY = "row", dictionary = "animals")
textcleaner
is the main function that handles the data
cleaning and preprocessing in SemNetCleaner (for argument
descriptions, see Table 2). For input into data
, it’s
strongly recommended that the user input the full verbal fluency dataset
and not data already separated into groups. If verbal fluency responses
are already separated, then they will need to be inputted and
preprocessed separately. Therefore, it’s preferable to separate the
preprocessed data into groups at a later stage of the SemNA
pipeline.
Table 2. textcleaner Arguments | |
Argument | Description |
---|---|
data
|
A matrix or data frame object that contains the participants’ IDs and semantic data |
miss
|
A number or character that corresponds to the symbol used for missing
data. The default is set to 99
|
partBY
|
Specifies whether participants are across the rows ("row" )
or down the columns ("col" )
|
dictionary
|
Specifies which dictionaries from SemNetDictionaries should be used
(more than one is possible). If no dictionary is chosen, then the
"general" dictionary is used
|
tolerance
|
Enables automated spell-checking using the Damerau-Levenshtein distance
(defaults to 1 )
|
When running the above code, textcleaner
will start
preprocessing the data immediately. The reader may notice that a
progress bar appears, which lets the user know about how many
more words need to be processed (i.e., number of words processed out of
how many words in total need to be processed). The progress bar should
read “10 of 269 words done”, meaning that textcleaner
has
already automatically processed several words. Before continuing with
the tutorial, we describe how the automatic spell-check operations of
textcleaner
work and then continue the tutorial with the
manual spell-check operation.
The first step of textcleaner
is to spell-check all
responses. The spell-checking algorithm of textcleaner
uses
automatic and manual spell-checking processes in parallel. First,
missing values (e.g., NA
), punctuations, digits, and extra
white spaces are removed from each response in the raw verbal fluency
data. From these responses, only the unique responses across
participants are obtained, which are used as input into the
spell-checking algorithm. Although these unique responses include
responses that are misspelled, they drastically reduce the number of
responses that textcleaner
needs to spell-check.
Next, these unique responses are checked against a dictionary and its associated monikers (only if it’s a dictionary from SemNetDictionaries) and replaced with a homogenized name (e.g., grizzly → grizzly bear). In this process, responses are checked against their plural and singular forms to further expedite the identification of correctly spelled responses. Responses that are matched with their plural form are converted to their singular form.
The unique responses that have not been matched in this process are then forwarded, one-by-one, to the spell-check algorithm. The algorithm will first attempt to auto-correct the response. If it cannot be auto-corrected, then the response is passed onto the manual portion of the algorithm. This process is repeated for each unique response entered into the spell-check algorithm. We first describe how a response gets auto-corrected in the automated spell-check and then we describe the manual spell-check for a response that could not be auto-corrected.
There are two auto-correct operations in the automated portion of the algorithm. The first auto-correct operation computes the Damerau-Levenshtein (DL) distance (Damerau, 1964; Levenshtein, 1966), a method to compute the edit distance (i.e., the (dis)similarity of two words), to determine how similar a given response is to every response in the dictionary. This computation is done by counting the number of errors—insertion (i.e., adding a letter), deletion (i.e., removing a letter), substitution (i.e., exchanging one letter for an incorrect letter), and transposition (i.e., changing the position of two adjacent letters)—that are made between the target word and potential response from the dictionary.
Notably, Damerau (1964) states that the
majority of spelling errors (more than 80%) are made with only one of
these errors. Based on this finding, the auto-correct operation in
textcleaner
can be set (using the tolerance
argument, see Table 2) to automatically correct an incorrect (or
inappropriate) response when the DL distance is less than or equal to
the given tolerance
value (e.g., one). The
tolerance
value is used as a criterion for how close a
response in the dictionary must be to the original response before it is
auto-corrected. These values are integers that range anywhere from 1 to
infinity. The default value is 1, following Damerau’s (1964) observation. Values greater than 1
provide a less strict criterion, however, this may increase the number
of incorrect corrections made by the automated portion of the algorithm.
If more than one response in the dictionary has a DL distance that is
less than or equal to the tolerance
value, then they are
passed onto a second auto-correct operation.
The second auto-correct operation checks for spelling errors that may
have been due to erroneous keystrokes on a QWERTY keyboard—the
so-called, QWERTY distance. This distance is computed by
summing the physical distance (i.e., number of keys) between the letters
in the response and the letters in the responses passed on from the
first auto-correct operation. The letter “f”, for example, has a
distance of one from the letters “d”, “e”, “r”, “t”, “g”, “v”, and “c”.
This second auto-correct operation will automatically correct an
incorrect (or inappropriate) response when the distance is less than or
equal to the same tolerance
value as the DL distance. If no
response or more than one response is less than or equal to the
tolerance
value, then the response is passed onto the
manual spell-check.
Because the automated spell-check occurs prior to manual spell-check
for each response, the user will only receive manual spell-check prompts
for responses that could not be auto-corrected. The manual
spell-checking operation allows the user to self-select the appropriate
correction by choosing one of several response options from an
interactive menu. Our tutorial will cover an example of each response
option in the interactive menu. After running the above
textcleaner
code on the open.animal
data, an
interactive menu appears that allows the reader to correct an
incorrectly spelled word (Figure 2; for figures, see Christensen & Kenett (2020)).
The first prompt contains a continuous string (i.e., multiple
responses entered as a single response):
turtle <<catdog>> elephant fish bird squiral rabbit fox deer monkey giraff
.
The target response that needs a decision is denoted between
<<
and >>
(in this example:
catdog
). Under the continuous string, the reader will find
response options (denoted by Potential responses:
) to
manually correct the response. The first 10 response options are the
responses in the dictionary that had the lowest DL distance with the
target response. The next six response options are additional options
that provide the user with a greater flexibility of options for
correcting responses. These six additional response options are defined
in the Table 3.
Table 3. Additional Response Options | |
Option | Description |
---|---|
11:ADD TO DICTIONARY
|
Allows user to add the response to a temporary appendix dictionary |
12:TYPE MY OWN
|
Allows user to type their own response if it is not provided in the potential response options (if necessary, multiple responses can be typed using spaces) |
13:GOOGLE IT
|
Opens the user’s default internet browser to Google’s webpage and searches for a definition of the original response using the terms: dictionary ‘RESPONSE’ |
14:BAD RESPONSE
|
Marks the original response as bad and makes it so the response will be
missing (i.e., NA ) and not included in the final results
|
15:SKIP
|
Allows the original response to be included in the final results but does not add it to the temporary appendix dictionary |
16:CONTEXT
|
(Single responses only) Provides the target response in context of the participant’s other responses. Will print each participant’s responses that provide the target response |
16:BAD STRING
|
(Continuous strings only) Marks the entire continuous string of
responses as bad and makes all responses missing (i.e., NA )
and not included in the final results
|
For the target response in this example (i.e., catdog
),
the participant likely intended to type cat and dog as
separate responses. When examining the offered responses for correction
(options 1-10
), the reader may notice that cat and
dog are listed but none of the response options have the option
to separate the response into cat and dog. To do this,
we can use one of the additional response options:
12:TYPE MY OWN
. This response option acts as a catch-all
option that enables the user to type the response that should replace
the original response.
To select this response option, the reader can type 12
and press ENTER
. Next, the reader will be prompted with
Type response:
. Here, the reader can type their correction
(without quotations) of what word(s) should replace the original
response. The response cat dog
should be typed and the
reader can press ENTER
(Figure 3).
This completes the first prompt and moves the reader to the second
prompt. The second prompt is another continuous string:
dog cat horse <<guinea>> pig rooster bird fish mouse rat owl
(Figure 4).
For this continuous string, all words are actually spelled correctly;
however, because textcleaner
sifts through each response
word-by-word, it’s stopped at the word guinea, which is not in
the dictionary. Based on the participant’s next word pig
,
they likely meant to type guinea pig. The reader should not
correct the response; instead, textcleaner
will handle this
when parsing the continuous string. As a general rule of thumb, the
reader should always focus on the target word when making a
correction. Because guinea is spelled correctly 1, the reader can use
the 15:SKIP
option, which will keep the word “as is”, by
pressing 15
and then ENTER
.
textcleaner
will remember this choice the next time it
encounters the word guinea and will no longer prompt the user
for a correction.
Next, the reader will be prompted when textcleaner
attempts to parse the response (Figure 4). Here, the reader should
decide whether guinea and pig should be combined into
a single response or remain separated as two responses. The response
should be combined into a single response, guinea pig, so the
response 1:combined:'guinea pig'
should be selected by
pressing 1
and then ENTER
. The next prompt is
another “combine or separate” response option for
"bat cat dog sheep"
. With this prompt, the reader can press
2
for 2:separated:'bat' cat' 'dog' 'sheep'
and
ENTER
to separate the string into individual responses.
Most responses are fairly easy to determine the word the participant
intended with the offered responses; however, there are instances where
it’s impossible to know exactly what the participant intended. An
example of this is in the next prompt:
dog cat <<mose>> moose horse lion tiger bear dear doe pig cow
(Figure 5).
Here, the first three response options: 1:mole
,
2:moose
, and 3:mouse
are equally plausible. On
the one hand, it’s unlikely the participant intended to type
mole because the “s” and “l” keys are quite distant from one
another. On the other hand, it’s hard to know whether the participant
intended to type moose or mouse. The next response in
the string is moose, which could mean that the participant
attempted to correct their initial response; however, there is no way of
knowing for certain. In these instances, our recommendation is to err on
the conservative side—that is, to not include the response in the final
results. To do so, the user can type 14
for
14:BAD RESPONSE
and press ENTER
, which will
remove the response from the final results.
The reader can continue through the next ten prompts using the response options that have been covered. Below is a table with the response options we selected for these prompts:
Table 4. Responses for next ten prompts | ||
Prompt | Selection | Type My Own |
---|---|---|
creatures
|
14:BAD RESPONSE
|
— |
catefrog
|
12:TYPE MY OWN
|
cat frog |
criters
|
14:BAD RESPONSE
|
— |
mario
|
14:BAD RESPONSE
|
— |
garafi
|
14:BAD RESPONSE
|
— |
snack
|
14:BAD RESPONSE
|
— |
girrage
|
14:BAD RESPONSE
|
— |
<<gieuna>> pig
|
12:TYPE MY OWN
|
guinea |
jesus
|
14:BAD RESPONSE
|
— |
squrill
|
5:squirrel
|
— |
After going through these prompts, the reader will arrive at the
prompt: <<your>> mom
. Sometimes all responses
in a prompt will be inappropriate for the category like your,
mom, and the string of your mom. In these instances,
the user can select 16
and press ENTER
for the
response option 16:BAD STRING
. This response option will
remove all responses in the string from the final results (Table 3). The
next two prompts—<<geaniu>> pig
and
dinasor
—can be corrected using response options we’ve
already covered: 12:TYPE MY OWN
(guinea
) and
2:dinosaur
.
After managing these two prompts, the reader comes to a prompt for
bluebird
. If the user is unsure whether a word is an actual
category exemplar (or just sounds like one), then they can press
13
and ENTER
for the response option
13:GOOGLE IT
. This will open the user’s default web browser
and search Google using the terms: “dictionary ‘bluebird’”.
When doing so, we can see that bluebird is indeed a category
exemplar. Because bluebird is not in the dictionary, the reader
should add it to their temporary appendix dictionary. The reader can do
so by pressing 11
and ENTER
for the response
option 11:ADD TO DICTIONARY
.
The options ADD TO DICTIONARY
and
TYPE MY OWN
allow the user to add the original or typed
response, respectively, to a temporary appendix dictionary. For
TYPE MY OWN
, the user will only be prompted to add the
response to the temporary appendix dictionary if the typed response is
not already in the (temporary) dictionary. textcleaner
will
use these additional words to facilitate the automation of future
instances of these words.
These examples fill out what is necessary to fully apply
textcleaner
to the data. At the end of the
textcleaner
process, the reader will be prompted on whether
they would like to save their appendix dictionary to their computer,
which allows them to use the dictionary in the future. If the user
chooses to save the dictionary to their computer, then they will be
asked to provide a name for this dictionary—for the tutorial, we named
it: appendix
. The file will then be saved as
appendix.dictionary.rds
in the directory the user chooses.
Note that the appendix dictionary does not actually update the original
pre-defined dictionary, so it’s necessary to input the name of the
dictionary in the dictionary
argument of
textcleaner
when using any appendix dictionary in the
future (e.g., dictionary = c("animals", "appendix")
).
textcleaner
OutputThere are several other output objects from the
textcleaner
function. These objects are stored in a list
object, which we designated in our example as clean
. These
output objects are summarized in the table below.
Table 5. textcleaner and correct.changes
Output Objects
|
||
Object | Nested Object | Description |
---|---|---|
binary
|
— | Binary response matrix where rows are participants and columns are responses. 1’s are responses given by a participant and 0’s are responses not given by a participant |
responses
|
||
clean.resp
|
Spell-corrected response matrix where the ordering of the original responses are preserved. Inappropriate and duplicate responses have been removed | |
orig.resp
|
Original response matrix where uppercase letters were made to lowercase and white spaces before and after responses were removed | |
spellcheck
|
||
full
|
List of all responses whether or not they have been spell-corrected | |
auto
|
List of only unique responses that were auto-corrected and corrected by the user | |
removed
|
||
rows
|
Vector of rows for the participants with no appropriate responses | |
ids
|
Vector of the participants’ IDs with no appropriate responses | |
partChanges
|
ID
|
List of list objects labeled with each participant’s ID variable. Each participant’s list contains a data frame of the specific words that were changed for the participant |
These objects can be accessed using a dollar sign (e.g.,
clean$responses
) and nested objects can be accessed within
their parent object (e.g., clean$responses$clean.resp
).
Some of these output are useful for accessing the spell-check changes
that occurred. For example, clean$spellcheck
contains
objects that refer to the full list of original responses regardless of
whether there were spelling changes ($full
) and a list of
unique responses that were corrected during the spell-check algorithm
($auto
). The removed
object contains lists of
participants who were removed because of a lack of appropriate
responses, which can be identified by either the participant’s row (or
column; $rows
) or ID variable ($ids
) in the
input dataset (these will be the same if no ID variable is provided).
Finally, the partChanges
object contains list objects,
which correspond to each participant’s unique ID and the specific
correction changes that were made to their responses.
Although textcleaner
is highly efficient and automatizes
most of the cleaning process, it’s possible that some of the
auto-correction changes are incorrect. Moreover, the user may have
entered a wrong response option or misspelled a response in the
TYPE MY OWN
option during the process. Therefore, the user
may still need to make corrections to the output provided by
textcleaner
. To view the changes made during the
spell-checking step, the reader can enter the following code:
The View
function will open a tab in R allowing the user
to examine a matrix containing all of the unique changes that were made
during the textcleaner
process (i.e.,
clean$spellcheck$auto
). The first column of this matrix is
named “from” and contains the unique raw responses given by the
sample. The next several columns are all named “to” and contain the
spell corrected responses made by textcleaner
. The reader
should see that the first row, for example, contains the response “life”
in the “from” column and “louse” in the “to” column. At first, this may
seem like an incorrect change; however, “life” was auto-corrected to
“lice” during the spell-checking process, which was then changed to
“louse” during the plural-to-singular form process.
Another worthwhile example is in the sixth row where a continuous string (i.e., “horse cat dog pig goat fidh deer duck swan goose bird eagle giraffe lion hippo”) was separated into individual responses. It’s important that the reader checks to make sure that (1) each response was separated correctly and (2) each response is spelled correctly. Finally, in the twenty-fourth row, “creatures” appears in the “from” column and the “to” column is blank. A blank in the “to” column means that the response in the “from” column has been removed from the preprocessed data.
The reader should inspect each response in the “from” column and
verify that the response(s) in the “to” column(s) are correct. If any
responses in the “to” column(s) are not correct, then they need
to be corrected. To do so, the function correct.changes
can
be used:
# Corrected 'clean' object from 'textcleaner'
corr.clean <- correct.changes(textcleaner.obj = clean,
dictionary = "animals",
incorrect = c("house", "beasts", "god",
"gunny pig", "liam", "loin",
"farrot", "oh my", "lizers",
"teranchilla","manster", "lamp"))
correct.changes
accepts textcleaner
objects
only. This means that the output we stored from our
textcleaner
run (i.e., clean
) should be input
into this function (i.e., textcleaner.obj = clean
). Like
textcleaner
, the user can also specify one or more
dictionaries from SemNetDictionaries to provide potential
response options. Finally, the argument incorrect
is used
as the input for responses that were incorrectly changed. The responses
entered here should be the original response in the “from” column. In
the code above, we’ve identified several of these responses (i.e.,
incorrect = c("house", "beasts", "god", "gunny pig", "liam", "loin", "farrot", "oh my", "lizers", "teranchilla","manster", "lamp"))
).
Similar to textcleaner
, correct.changes
uses an interactive menu to correct responses. The first three response
options—1:TYPE MY OWN
, 2:GOOGLE IT
, and
3:BAD RESPONSE
—are the same as those in
textcleaner
(Table 3). Following these response options are
the potential responses from the dictionary. If a response from the
dictionary does not offer the appropriate correction, then
TYPE MY OWN
can be used. When using
TYPE MY OWN
, if the old response was intended to be
multiple responses (e.g., catdog), then the user should type a
comma to separate the responses (e.g., cat, dog
). If no
comma is added, then correct.changes
will consider the
TYPE MY OWN
response as a continuous string.
BAD RESPONSE
is necessary if the changed response is not
correct and the old response is an inappropriate category exemplar (this
also works for continuous strings).
The reader should run through correct.changes
and make
the appropriate changes. Below is a table of the changes we applied:
Table 6. Responses for correct.changes
|
|||
From | To | Selection | Type My Own |
---|---|---|---|
house
|
mouse
|
3:BAD RESPONSE
|
— |
beasts
|
yeast
|
3:BAD RESPONSE
|
— |
god
|
cod
|
3:BAD RESPONSE
|
— |
gunny pig
|
'bunny' 'pig'
|
4:guinea pig
|
— |
liam
|
lion
|
3:BAD RESPONSE
|
— |
loin
|
loon
|
4:lion
|
— |
farrot
|
parrot
|
1:TYPE MY OWN
|
ferret |
oh my
|
ox
|
3:BAD RESPONSE
|
— |
lizers
|
liger
|
5:lizard
|
— |
teranchilla
|
chinchilla
|
5:tarantula
|
— |
manster
|
hamster
|
3:BAD RESPONSE
|
— |
lamp
|
lamb
|
3:BAD RESPONSE
|
— |
When finished, correct.changes
will store its output in
the object corr.clean
. Once again, the user should verify
that all changes are correct using
View(corr.clean$spellcheck$auto)
. This process should be
repeated until all changes are correct. Once thoroughly checked for
accuracy, a final .csv file can be saved to distribute these changes to
others (e.g., colleagues, peer reviewers), enhancing the transparency of
the preprocessing stage of SemNA. To do so, the reader can create a .csv
file:
correct.changes
outputThe output of correct.changes
is exactly the same as
textcleaner
(Table 5), except that it has been corrected
for the incorrect changes in the textcleaner
output. Note
that this output was saved in a object with a different name:
corr.clean
. A couple of these objects are worth detailing
further because they can be used for standard verbal fluency analyses.
First, the nested object clean.resp
contains the cleaned
verbal fluency data for each participant in the order the participant
gave the responses. These data are useful for performing standard
analyses of clustering and switching (e.g.,
Troyer, Moscovitch, & Winocur, 1997), particularly with the
advent of automated scoring procedures (e.g.,
Kim, Kim, Wolters, MacPherson, & Park, 2019). This can be
exported using the following code:
# Save .csv of clean responses
write.csv(corr.clean$responses$clean.resp, "cleaned_verbal_fluency.csv")
Second, the binary
object contains the binary response
matrix where each participant received a `1' for a response they
provided and a `0' for a response they did not. This matrix can be used
to total the number of appropriate responses each participant gave. This
can be done using the following code:
In this section we described and demonstrated how the packages
SemNetDictionaries and SemNetCleaner are used to
facilitate efficient and reproducible preprocessing of verbal fluency
data. In this process, the raw data have been spell-checked, duplicate
and inappropriate responses have been removed, monikers have been
converged into one response, and a binary response matrix has been
generated. The binary response matrix (corr.clean$binary
)
is used in the next stage of the SemNA pipeline to estimate semantic
networks.
Although guinea is spelled correctly, it should not be added to the dictionary—there is no animal with only the name guinea. If added, then the auto-correct functions will begin treating guinea as an appropriate category exemplar, despite it not being an appropriate response by itself.↩︎