-
Notifications
You must be signed in to change notification settings - Fork 106
Matrix
The matrix
package provides tools for dense and sparse matrices. We included things such as matrix factorization methods, matrix transforms, matrix multiplication, and serialization/deseriatization between various popular formats.
All code is centered around two interfaces
-
Matrix.java
, which provides the core methods for accessing and updating values in a matrix. -
SparseMatrix.java
, which provides access to sparse vectors in a matrix.
Beyond these two core interfaces, we the following tools centered around matrices:
- Matrix Factorization
- Matrix Smoothing or Scaling
- Matrix Serialization/Deserialization
-
YaleSparseMatrix
, using the [Yale Sparse Matrix Format] (http://en.wikipedia.org/wiki/Sparse_matrix). This matrix is ideal for sparse matrices that can fit in memory. -
GrowingSparseMatrix
, using the [Yale Sparse Matrix Format] (http://en.wikipedia.org/wiki/Sparse_matrix) this matrix can grow to any size based on the largest row and column value set. -
AtomicGrowingMatrix
, an atomic matrix that will grow to any size based on the largest row and column value set. -
AtomicGrowingSparseMatrix
, a sparse atomic matrix that will grow to any size based on the largest row and column value set. -
AtomicGrowingSparseHashMatrix
, a sparse atomic matrix, backed by a hash map, that will grow to any size based on the largest row and column value set. This provides faster access to each cell in the matrix at the cost of accessing rows as vectors. -
ArrayMatrix
, a dense, in memory matrix for reasonably small matrices. -
OnDiskMatrix
, which stores all values on disk, suitable for extremely large matrices. -
SparseOnDiskMatrix
, which stores all values on disk in a sparse format, suitable for extremely large matrices.
Matrices.java
provides addition utilities for handling matrices, such as:
- transposing a matrix.
- getting a
synchronized
instance of any otherMatrix
instance. - generating a matrix from a list of [SparseDoubleVector] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/vector/SparseDoubleVector.html)s or DoubleVectors.
- automatically selecting and instantiating the right
Matrix
class to use based on user-provided parameters and available system resources.
Many Semantic Space models require a reduced feature space. Matrix factorization serves as the main method for this reduction. We provide three powerful factorization methods:
- Singular Value Decomposition: A given matrix is split into three smaller factor matrices that best approximate the original matrix.
- Non-negative Matrix Factorization: A given matrix is split into two probabilistic matrices that approximate the original dataset.
- Locality Preserving Projections: Similar to SVD, this method projects a dataset into a smaller latent feature space using local features that capture and retain non-linear features of the dataset as a whole.
Both SVD and NMF factorize an original dataset into two disctint matrices. If the original data set is a word by document matrix, then the resulting factor matrices can be best described as:
- A term by latent factor matrix
- A latent factor by document matrix.
Both methods implement the generic Matrix Factorization interface that allows users to easily change the factorization method. For example, the following code will save the factor matrices generated by SVD and NMF:
Matrix dataset = ...
MatrixFactorization factorizor = new SingularValueDecompositionLibC();
MatrixIO.writeMatrix(factorizor.dataClasses(), new File("svd-ws.dat"), Format.DENSE_TEXT);
MatrixIO.writeMatrix(factorizor.classFeatures(), new File("svd-ds.dat"), Format.DENSE_TEXT);
factorizor = new NonNegativeMatrixFactorizationMultiplicative();
MatrixIO.writeMatrix(factorizor.dataClasses(), new File("nmf-ws.dat"), Format.DENSE_TEXT);
MatrixIO.writeMatrix(factorizor.classFeatures(), new File("nmf-ds.dat"), Format.DENSE_TEXT);
factorizor.dataClasses()
returns the term by latent factor matrix while factorizaor.classFeatures()
returns the latent factor by document matrix in this example. Both methods can similary be used on other data matrices, such as document by term matrices, term by term matrices, or other more interesting data sets.
We've included several smoothing metrics, which we refer to as transforms, for data matrices. These typically take a large sparse matrix consisting of either word co-occurrence counts or word document counts and reduce the value of non-informative features and boost the value of highly-informative features. We currently support the following transformations:
- [Log Entropy Transform] (http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix), which was introduced in the [Latent Semantic Analysis] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis) model.
- [Tf-Idf Transform] (http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix), which was introduced forin the [Vector Space Model] (/fozziethebeat/S-Space/wiki/VectorSpaceModel).
- Correlation Transform, which was introduced for the COALS model.
- Log Likelihood Transform.
- Point-Wise Mutual Information Transform, which was introduced for the Dependency Vector Space model.
We are also adding matrix normalization techniques such as
- Row Magnitude Transform, which normalizes each row to have a magnitude of 1.
All of the above methods implement the same Transform interface, which can be applied to both in memory matrices, and our variety of on disk matrix formats. To transform an existing matrix with one of the above methods, simply do
Matrix dataset = ...
Transform transform = ...
Matrix transformed = transform.transform(dataset);
And to Transform serialized matrix, simply do
Transform transform = ...
File dataset = ...
Format datasetFormat = ...
File transformed = transform.transform(dataset, datasetFormat)
We provide support for a variety of external data formats for matrix files. Each format has it's own set of trade offs. In particular, some formats are optimal for sparse datasets, while others are best for dense datasets. Similary, binary formats are compact in size, but a not human readable while text based formats consume more disk space.
- [SVDLIBC_SPARSE_TEXT] (http://tedlab.mit.edu/~dr/svdlibc/SVD_F_ST.html)
- [SVDLIBC_DENSE_TEXT] (http://tedlab.mit.edu/~dr/svdlibc/SVD_F_DT.html)
- [SVDLIBC_SPARSE_BINARY] (http://tedlab.mit.edu/~dr/svdlibc/SVD_F_SB.html)
- [SVDLIBC_DENSE_BINARY] (http://tedlab.mit.edu/~dr/svdlibc/SVD_F_DB.html)
- [MATLAB_SPARSE] (http://bebop.cs.berkeley.edu/smc/formats/matlab.html)
- DENSE_TEXT, Each row has it's own line, with all column values specified
- [CLUTO_DENSE_TEXT] (http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf) (same format as DENSE_TEXT)
- [CLUTO_SPARSE_TEXT] (http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf)
MatrixIO
provides serialization and deserialziation support for Matrix
instances. For example, the following code will read a SVDLIBC_SPARSE_BINARY
file from disk and load it into a new Matrix
:
File matrixFile = ...
Matrix matrix = MatrixIO.readMatrix(matrixFile, Format.SVDLIBC_SPARSE_BINARY);
Similarly, the following will write the now in memory matrix into the MATLAB_SPARSE
format:
File outputFile = ...
MatrixIO.writeMatrix(matrix, outputFile, Format.MATLAB_SPARSE);
We also provide support for iterating through values in a stored matrix through the an iterator. For example
Iterator<MatrixEntry> entryIter = MatrixIO.getMatrixFileIterator(outputFile, Format.MATLAB_SPARSE);
while (entryIter.hasNext()) {
MatrixEntry entry = entryIter.hasNext();
System.out.printf("%d %d %f\n", entry.row(), entry.column(), entry.value());
}
will iterate through every value that was written to outputFile
and print the row, column, and cell value.