Skip to content

DocumentRepresentations

David Jurgens edited this page Apr 6, 2015 · 3 revisions

Representing documents using Semantic Spaces

Introduction

A SemanticSpace model provides representations of terms based on their usage. Using these terms, the contents of a document can be projected into the Semantic Space by combining the semantic representations of the terms found in the document. This document provides instructions on how to create these document representations.

Basic Steps

In order to project a document's contents into a SemanticSpace, there are three basic steps

  • Build a SemanticSpace using one of the current algorithms.
  • Load the SemanticSpace into a DocumentVectorBuilder.
  • Provide the DocumentVectorBuilder with documents which it should represent based on the SemanticSpace.

Building a SemanticSpace is covered in other documents, but a good introduction for this is [Latent Semantic Analysis] (/fozziethebeat/S-Space/wiki/LatentSemanticAnalysis) or [Random Indexing] (/fozziethebeat/S-Space/wiki/RandomIndexing).

Using a DocumentVectorBuilder

The DocumentVectorBuilder will tokenize a document and request semantic representations of the document's terms from a SemanticSpace. These term vectors will then be combined together based on their usage in the document to form a representation of the document. This document representation will have the same dimensionality as the term vectors themselves, and can be viewed as a projection of the document in the SemanticSpace.

A DocumentVectorBuilder is created simply by providing a pre-built SemanticSpace, along with options specifying how term vectors should be combined. Typically, the SemanticSpace passed into a DocumentVectorBuilder is one which has been serialized to disk, and then loaded back from disk as a StaticSemanticSpace.

The DocumentVectorBuilder provides several options for combining term vectors

  • Weight vectors based on their term frequency in the given document
  • Provide no weighting, term vectors are only used once if the term is in a document.

Current the DocumentVectorBuilder combines term vectors simply by summing them together and returning the final summation. In the future more advanced combination methods may be added, such as a circular convolution of term vectors.

Using the Document Representation

Once a set of document representations have been generated, they can be used to determine the similarity between any two documents representations that were generated from the same DocumentVectorBuilder. Since these representations are have the same type as term representations, the existing methods for computing the Semantic Similarity can be used as well.

A Sample Use Case

The following sample main shows how one might read a serialized semantic space from a file for the purpose of creating a DocumentVectorBuilder. This program would then process two documents, which are both passed in as command line arguments, and print out the semantic similarity of the two documents.

import edu.ucla.sspace.common.DocumentVectorBuilder;
import edu.ucla.sspace.common.SemanticSpace;
import edu.ucla.sspace.common.SemanticSpaceIO;

import edu.ucla.sspace.vector.DenseVector;
import edu.ucla.sspace.vector.Vector;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;


public class SampleMain {
    public static void main(String[] args) throws IOException {
        SemanticSpace sspace = SemanticSpaceIO.load(args[0]);

        // See how many dimensions are present in the SemanticSpace so that
        // we can initialize the document vectors correctly.
        int numDims = sspace.getVectorLength();

        // Create the DocumentVectorBuilder which will use the
        // term-to-vector mapping in the SemanticSpace to construct
        // document-level representations
        DocumentVectorBuilder builder = new DocumentVectorBuilder(sspace);

        // Create the first document's vector.  We create this manually (as
        // the caller) because we should have some sense of how big the
        // document is and whether it makes sense to have the document be
        // represented as a sparse or full vector.
        DoubleVector documentVector = new DenseVector(numDis);
        
        // Read in the first document and create a representation of it by
        // summing the vectors for all of its tokens                        
        BufferedReader br = new BufferedReader(new FileReader(args[1]));
        builder.buildVector(br, documentVector);
        br.close();

        // Read in a second document and create a representation of it by
        // summing the vectors for all of its tokens
        br = new BufferedReader(new FileReader(args[2]));
        DoubleVector documentVector2 = new DenseVector(numDims);
        builder.buildVector(br, documentVector2);
        br.close();

        double similiarty =
            Similarity.cosineSimilarity(documentVector, documentVector2);
        System.out.printf("The similarity of %s and %s is %f%n",
                          args[1], args[2], sim);

    }
}