Estimating Semantic Networks

Vignette taken directly from Christensen & Kenett (2020)

With the binary response matrix, semantic networks can now be estimated. In the last few years, various computational approaches have been proposed to estimate semantic networks from verbal fluency data (Goñi et al., 2011; Kenett et al., 2013; Lerner, Ogrocki, & Thomas, 2009; Zemla & Austerweil, 2018). Moreover, there are a number of packages in R that are capable of estimating semantic networks [e.g., corpustools; Welbers & van Atteveldt (2018)] and networks more generally [e.g., igraph; (csardi2006igraph?) and qgraph; (epskamp2012qgraph?)]. As described earlier, this tutorial follows the approach developed by Kenett and colleagues to estimate semantic networks based on correlations of the associations profiles of verbal fluency responses across the sample (Borodkin, Kenett, Faust, & Mashal, 2016; Kenett, Beaty, Silvia, Anaki, & Faust, 2016; Kenett et al., 2013).

The SemNetCleaner, SemNeT, and NetworkToolbox packages in R will be used to execute this stage of the pipeline. The SemNetCleaner package will be used to further process the binary response matrix into a finalized format for network estimation. The SemNeT package (christensen2019semnet?) contains several functions for the analysis of semantic networks, including a function to compute the association profiles of verbal fluency responses. The NetworkToolbox package (christensen2019networktoolbox?) contains functions for network analysis more generally, including functions to estimate and analyze networks. This package will be used to estimate the semantic networks from the association profile matrices.

Process

Kenett and colleagues’ approach begins by splitting the binary response matrix into groups. Next, for each group, only responses that are provided by two or more participants are retained (e.g., Borodkin et al., 2016). This is done to minimize spurious associations driven by idiosyncratic responses in the sample. Finally, binary response matrices are “equated” or their responses are matched such that each group only retains responses if they are given by all other groups (Kenett et al., 2013).

This step is particularly important because some groups may have a different number of responses (i.e., nodes), which can introduce confounding factors [e.g., biased comparison of network parameters; van Wijk, Stam, & Daffertshofer (2010)]. By equating the binary response matrices, the networks can be compared using the same nodes, ruling out alternative explanations of the results (e.g., difference in network structure) that could be due to differences in the number of nodes (Borodkin et al., 2016). Once this process is complete, the networks can be estimated using a network estimation method.

We continue with the example of the dataset analyzed by Christensen, Kenett, Cotter, Beaty, & Silvia (2018) that estimated and compared semantic networks of two groups—low and high openness to experience groups. While we focus on estimating and comparing two groups, the functions in our R packages are capable of handling more than two groups.

Preparation for network estimation

The binary response matrix (i.e., corr.clean$binary) from the preprocessing step contains the responses for both the low and high openness to experience groups. To continue with our pipeline, we need to separate the binary response matrix into two groups. This can be done using the Group variable with the following code:

# Attach 'Group' variable to the binary response matrix
behav <- cbind(open.animals$Group, corr.clean$binary)
# Create low and high openness to experience response matrices
low <- behav[which(behav[,1]==1),-1]
high <- behav[which(behav[,1]==2),-1]

The resulting matrices are the binary response matrices for the low and high openness to experience groups. For users who would like to use other network estimation methods that are not included in R, these binary response matrices can be exported using the following code:

# Save binary response matrices
write.csv(low, "low_BRM.csv", row.names = TRUE)
write.csv(high, "high_BRM.csv", row.names = TRUE)

Continuing with our pipeline, we aim to minimize the number of spurious associations in the network. This can be executed with the following code:

# Finalize matrices so that each response
# has been given by at least two participants
final.low <- finalize(low, minCase = 2)
final.high <- finalize(high, minCase = 2)

The function finalize will remove responses (columns) that have responses that are not given by a certain number of people. The number of people that must give a response can be chosen using the minCase argument. This argument defaults to 2, which is consistent with our approach; however, users may wish to define a higher number of minimum cases to avoid spurious associations. Next, the responses are equated to control for differences in the number of nodes. To do this, the following code can be used:

# Equate the responses across the networks
eq <- equate(final.low, final.high)
equate.low <- eq$final.low
equate.high <- eq$final.high

The equate function will match the responses across any number of groups. If there are more than two groups, then they simply need to be entered (separated by commas) into the function. The output of equate are binary response matrices that have been matched across groups. Each group’s matrix will be nested in the output and labeled with the name of the object used as input (e.g., input = final.low and output = eq$final.low).

Now that the binary response matrix has been separated into two groups based on our behavioral measure and the responses have been equated between the two groups, the networks can be estimated.

Network estimation

The network estimation method that Kenett et al. apply to estimate semantic networks are called correlation-based networks (Zemla & Austerweil, 2018). They are called correlation-based networks because they estimate the network based on how often responses co-occur across the group (Borodkin et al., 2016; Kenett et al., 2013). Common association measures that have been used with this approach are Pearson’s pairwise correlation (e.g., Kenett et al., 2013) and cosine similarity (e.g., Christensen, Kenett, Cotter, et al., 2018). Thus, the nodes in this network represent verbal fluency responses and the edges represent their association.

In our example of the work by Christensen, Kenett, Cotter, et al. (2018), the cosine similarity was used to compute the association profiles of the responses. We can apply this similarity measure with the following code:

# Compute cosine similarity for the 'low' and
# 'high' equated binary response matrices
cosine.low <- similarity(equate.low, method = "cosine")
cosine.high <- similarity(equate.high, method = "cosine")

The similarity function in the SemNeT package computes an association matrix from the equated response matrices. The method argument selects the association measure that is used. Here, we use the "cosine" similarity measure; however, there are a number of other similarity measures, such as Pearson’s correlation (method = "cor"), that can be applied (see ?similarity for more options). With these association matrices, a network estimation method can be applied.

To further minimize spurious relations, we proceed to apply a filter over other association matrix. The purpose of applying a network filtering method is to minimize spurious associations and retain the most relevant information in the network (Tumminello, Aste, Di Matteo, & Mantegna, 2005). Network estimation methods have certain criteria for retaining edges (e.g., statistical significance), which creates a more parsimonious model (Barfuss, Massara, Di Matteo, & Aste, 2016). For Kenett and colleagues approach, a family of network estimation methods known as Information Filtering Networks (Barfuss et al., 2016; Christensen, Kenett, Aste, Silvia, & Kwapil, 2018) have been applied.

The Information Filtering Networks methods apply various geometric constraints on the associations of the data to identify the most relevant information between nodes (e.g., edges) in a network (Christensen, Kenett, Aste, et al., 2018). Common Information Filtering Network approaches are the minimal spanning tree (Mantegna, 1999), planar maximally filtered graph (Tumminello et al., 2005), triangulated maximally filtered graph (Massara, Di Matteo, & Aste, 2016), and maximally filtered clique forest (Massara & Aste, 2019).

In Christensen, Kenett, Cotter, et al. (2018), the triangulated maximally filtered graph (TMFG) method was applied. The TMFG algorithm identifies the most important edges in a network by first connecting the four nodes that have the highest sum of edge weights (i.e., association) across all nodes. Next, the algorithm identifies and adds an additional node, which maximizes its sum of edge weights to the other connected nodes. The algorithm continues until every node is connected in the network (Massara et al., 2016; golino2018ega3?).

The resulting network has 3n − 6 number of edges (where n is the number of nodes) and is a planar network [i.e., it could be depicted on a theoretical plane without any edges crossing; Tumminello et al. (2005)]. Because the number of edges is a function of the number of nodes, networks with the same number of nodes will have the same number of edges. This is advantageous for comparing network structures because it reduces the confound of differences between networks being due to differences in the number of edges (Christensen, Kenett, Aste, et al., 2018; van Wijk et al., 2010). The TMFG method can be implemented, using the NetworkToolbox package in R,1 with the following code:

# Estimate 'low' and 'high' openness to experience networks
net.low <- TMFG(cosine.low)$A
net.high <- TMFG(cosine.high)$A

The output of these functions is a TMFG filtered semantic network for the low and high openness to experience groups. To save these networks outside of R so that other programs can be applied, the following code can be used:

# Save the networks
write.csv(net.low, "network_low.csv", row.names = FALSE)
write.csv(net.high, "network_high.csv", row.names = FALSE)

These networks are weighted, meaning that the edges correspond to the magnitude of association between nodes. It’s common, however, for the edges to be converted to binary values [i.e., 1 = edge present and 0 = edge absent; Abbott, Austerweil, & Griffiths (2015); Kenett et al. (2013); Kenett, Anaki, & Faust (2014)]. To convert a weighted network into one that is unweighted, the binarize function can be used:

# Binarize the networks (optional)
net.low <- binarize(net.low)
net.high <- binarize(net.high)

It’s worth noting that, despite differences in edge weights, it has been shown that weighted and unweighted semantic networks typically correspond to one another (Abbott et al., 2015). When computing network measures in SemNeT, the edges will be binarized by default, meaning the statistics are computed for unweighted measures. There are options, however, to compute the weighted measures when the networks are left as weighted; therefore, it’s often preferred to keep the networks as weighted.

Summary

In this section, we discussed and applied one approach for estimating group-based semantic networks using functions in SemNetCleaner, SemNeT, and NetworkToolbox. In this process, the binary response matrix was split into groups, idiosyncratic responses were removed, and group binary response matrices were equated (using SemNetCleaner). Then, a similarity measure was applied to these group matrices (using SemNeT) and a network estimation method was applied (using NetworkToolbox).

Notably, there are other approaches for estimating semantic networks (e.g., Zemla & Austerweil, 2018). These other approaches fit seamlessly into our SemNA pipeline. For example, the binary response matrix from the preprocessing step can be used in another network estimation procedure. The output from the network estimation step are network(s) that are ready to be analyzed in the statistical analysis step of the pipeline. Effectively, this makes the network estimation step in the pipeline exchangeable with any other network estimation procedure.

For next steps, see Analyzing_Networks vignette in the SemNeT package

References

Abbott, J. T., Austerweil, J. L., & Griffiths, T. L. (2015). Random walks on semantic networks can resemble optimal foraging. Psychological Review, 122, 558–569. https://doi.org/10.1037/a0038693
Barfuss, W., Massara, G. P., Di Matteo, T., & Aste, T. (2016). Parsimonious modeling with information filtering networks. Physical Review E, 94, 062306. https://doi.org/10.1103/PhysRevE.94.062306
Borodkin, K., Kenett, Y. N., Faust, M., & Mashal, N. (2016). When pumpkin is closer to onion than to squash: The structure of the second language lexicon. Cognition, 156, 60–70. https://doi.org/10.1016/j.cognition.2016.07.014
Christensen, A. P., & Kenett, Y. N. (2020). Semantic network analysis (SemNA): A tutorial on preprocessing, estimating, and analyzing semantic networks. PsyArXiv. https://doi.org/10.31234/osf.io/eht87
Christensen, A. P., Kenett, Y. N., Aste, T., Silvia, P. J., & Kwapil, T. R. (2018). Network structure of the Wisconsin Schizotypy Scales–Short Forms: Examining psychometric network filtering approaches. Behavior Research Methods, 50, 2531–2550. https://doi.org/10.3758/s13428-018-1032-9
Christensen, A. P., Kenett, Y. N., Cotter, K. N., Beaty, R. E., & Silvia, P. J. (2018). Remotely close associations: Openness to experience and semantic memory structure. European Journal of Personality, 32, 480–492. https://doi.org/10.1002/per.2157
Goñi, J., Arrondo, G., Sepulcre, J., Martincorena, I., de Mendizábal, N. V., Corominas-Murtra, B., … Villoslada, P. (2011). The semantic organization of the animal category: Evidence from semantic verbal fluency and network theory. Cognitive Processing, 12, 183–196. https://doi.org/10.1007/s10339-010-0372-x
Kenett, Y. N., Anaki, D., & Faust, M. (2014). Investigating the structure of semantic networks in low and high creative persons. Frontiers in Human Neuroscience, 8, 407. https://doi.org/10.3389/fnhum.2014.00407
Kenett, Y. N., Beaty, R. E., Silvia, P. J., Anaki, D., & Faust, M. (2016). Structure and flexibility: Investigating the relation between the structure of the mental lexicon, fluid intelligence, and creative achievement. Psychology of Aesthetics, Creativity, and the Arts, 10, 377–388. https://doi.org/10.1037/aca0000056
Kenett, Y. N., Wechsler-Kashi, D., Kenett, D. Y., Schwartz, R. G., Ben Jacob, E., & Faust, M. (2013). Semantic organization in children with cochlear implants: Computational analysis of verbal fluency. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00543
Lerner, A. J., Ogrocki, P. K., & Thomas, P. J. (2009). Network graph analysis of category fluency testing. Cognitive and Behavioral Neurology, 22, 45–52. https://doi.org/10.1097/WNN.0b013e318192ccaf
Mantegna, R. N. (1999). Hierarchical structure in financial markets. The European Physical Journal B Condensed Matter and Complex Systems, 11, 193–197. https://doi.org/10.1007/s100510050929
Massara, G. P., & Aste, T. (2019). Learning clique forests. arXiv. Retrieved from https://arxiv.org/abs/1905.02266
Massara, G. P., Di Matteo, T., & Aste, T. (2016). Network filtering for big data: Triangulated Maximally Filtered Graph. Journal of Complex Networks, 5, 161–178. https://doi.org/10.1093/comnet/cnw015
Tumminello, M., Aste, T., Di Matteo, T., & Mantegna, R. N. (2005). A tool for filtering information in complex systems. Proceedings of the National Academy of Sciences, 102, 10421–10426. https://doi.org/10.1073/pnas.0500298102
van Wijk, B. C. M., Stam, C. J., & Daffertshofer, A. (2010). Comparing brain networks of different size and connectivity density using graph theory. PloS ONE, 5, e13701. https://doi.org/10.1371/journal.pone.0013701
Welbers, K., & van Atteveldt, W. (2018). corpustools: Managing, querying and analyzing tokenized text. Retrieved from https://CRAN.R-project.org/package=corpustools
Zemla, J. C., & Austerweil, J. L. (2018). Estimating semantic networks of groups and individuals from fluency data. Computational Brain & Behavior, 1, 36–58. https://doi.org/10.1007/s42113-018-0003-7

  1. Note that other filtering methods can also be applied using the NetworkToolbox including the minimal spanning tree, maximally filtered clique forest, and several thresholding methods.↩︎