-
Notifications
You must be signed in to change notification settings - Fork 106
LatentSemanticAnalysis
Latent Semantic Analysis (LSA) is an algorithm that uses a collection of documents to construct a semantic space. The algorithm constructs a word-by-document matrix where each row corresponds to a unique word in the document corpus and each column corresponds to a document. The value at each position is how many times the row's word occurs in the column's document. Then the [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) is calculated for the word-document matrix to produce three matrices (UΣV), U - the wordspace, Σ - the singular values, and V - the document space. The columns of U are then truncated to a small number of dimensions (typically 300), which produces the final semantic vectors.
For more information on LSA, see the [Wikipedia page] (http://en.wikipedia.org/wiki/Latent_semantic_analysis) on LSA. Also the following papers give a good introduction to the uses of LSA:
-
T. K. Landauer and S. T. Dumais, "A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge," Psychological Review, vol. 104, pp. 211–240, 1997. Available [here] (http://lsa.colorado.edu/papers/plato/plato.annote.html)
-
T. K. Landauer, P. W. Foltz, and D. Laham, "Introduction to Latent Semantic Analysis," Discourse Processes, no. 25, pp. 259–284, 1998. Available [here] (http://lsa.colorado.edu/papers/dp1.LSAintro.pdf).
The current S-Space implementation of LSA is captured in two files. LatentSemanticAnalysis.java
contains all of the algorithmic implementation, and is suitable for use in other code as a library. LSAMain.java
is a command-line invokable version of LSA that uses the LatentSemanticAnalysis
class.
Our LSA implementation requires installation of a [Singular Value Decomposition] (/fozziethebeat/S-Space/wiki/SingularValueDecomposition) method.
LSA can be invoked either using java edu.ucla.sspace.mains.LSAMain
or through the jar release java -jar lsa.jar
Both ways are equivalent.
We provide the following options for changing the behavior of LSA. For standard options, see Mains.
-
LSA Options
-
-n, --dimensions <int>
how many dimensions to use for the LSA vectors. See LatentSemanticAnalysis for default value -
-p, --preprocess <class name>
specifies an instance of a Transform to use in preprocessing the word-document matrix compiled by LSA prior to computing the SVD.
-
-
Advanced Options
-
-S, --svdAlgorithm
The --svdAlgorithm provides a way to manually specify which algorithm should be used internally. This option should not be used normally, as LSA will select the fastest algorithm available. However, in the event that it is needed, valid options are: SVDLIBC, MATLAB, OCTAVE, JAMA and COLT
-
The LSA program is the definitive authority on the current set of options and their configurations. If you find an option is incorrectly specified on this page, please [let us know] (mailto:[email protected]). Full documentation may be found on the command line by running the lsa.jar
program without any options.
Generates a simple .sspace file with the default 300 dimensions.
java -jar lsa.jar -d corpus.txt my-lsa-output.sspace
Has the JVM use 4GB of ram when performing LSA (more ram is almost always better)
java -Xmx8g -jar lsa.jar -d corpus.txt my-lsa-output.sspace
Removes stop words from the corpus while processing. (Note: LSA doesn't do this in the original papers)
java -Xmx8g -jar lsa.jar -d corpus.txt -F exclude=stopwords.txt my-lsa-output-no-stopwords.sspace
Generates an LSA space with 500 dimensions
java -Xmx8g -jar lsa.jar -d corpus.txt -n 500 my-lsa-output-500dim.sspace
Generates an LSA space with known compound words
java -Xmx8g -jar lsa.jar -d corpus.txt -C my-list-of-ngrams.txt my-lsa-output-with-ngrams.sspace
Runs LSA with SVDLIBJ specifically (Note: the algorithm choice shouldn't affect the final vector values - only the runtime of LSA)
java -Xmx8g -jar lsa.jar -d corpus.txt -S SVDLIBJ my-lsa-output.sspace
-
We are grateful for the advice and assistance of Tom Landauer, Walter Kintsch and Praful Mangalath of the Latent Semantic Analysis group at the University of Colorado, Boulder.
-
We are grateful to Doug Rohde for making the SVDLIBC program freely available.
-
We are very grateful to Adrian Kuhn and David Erni for creating SVDLIBJ by porting SVDLIBC to Java.