-
Notifications
You must be signed in to change notification settings - Fork 106
Coals
Coals is an algorithm that uses a collection of documents to construct a a semantic space. The algorithm constructs a word-by-word matrix where each element in the matrix represents how frequently word_i occurs with word_j. The matrix is then normalized by correlation, and any negative values are set to zero and all other values are replaced by it's square root. Then, optionally, the word co-occurence matrix M is reduced using the [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) and retains the U matrix as the difinitive wordspace.
For more information on Coals, the following paper is the central resource:
- D. L. T. Rohde, L. M. Gonnerman, D. C. Plaut, "An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence." Cognitive Science
The current S-Space implementation of Coals is captured in two files. Coals.java
contains all of the algorithmic implementation, and is suitable for use in other code as a library. CoalsMain.java
is a command-line invokable version of Coals that uses the Coals
class. This class is provided as coals.jar on the release packages.
Coals requires that a [Singular Value Decomposion] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) method be installed.
Coals can be invoked either using java edu.ucla.sspace.mains.CoalsMain
or through the jar release java -jar coals.jar
. Both ways are equivalent.
We provide the following options for changing the behavior of Coals. Standard options can be found here
-
-s | --reducedDimension <int>
Set the number of dimension to reduce to using the Singular Value Decompositon. This is used if --reduce is set. -
-n | --dimensions <int>
Set the number of columns to keep in the raw co-occurance matrix -
-m | --maxWords <int>
Set the maximum number of words to keep in the space, ordered by frequency. -
-r | --reduce
Set to true if the co-occurrance matrix should be reduced using the Singluar Value Decomposition.
The program will then produce a file that contains the entire semantic space. Each line in the file is formatted as follows:
word name|value-1 value-2 ... value-N
where N is the number of dimensions in the semantic space.
- We are grateful to Doug Rohde for making the SVDLIBC program freely available.