From 9b7a7656ae1ef0271841475cbe0d7bafb212ff3d Mon Sep 17 00:00:00 2001 From: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Thu, 16 Jan 2020 20:24:49 -0700 Subject: [PATCH] Add content to documentation homepage. --- docs/src/index.md | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/docs/src/index.md b/docs/src/index.md index 5214b2a..c7ecbdb 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,3 +1,35 @@ # LSH.jl -Documentation for the LSH.jl package. +LSH.jl is a Julia package for performing [locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) with various similarity functions. + +## Introduction +One of the simplest methods for classifying, categorizing, and grouping data is to measure how similarities pairs of data points are. For instance, the classical [``k``-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) takes a similarity function + +```math +s:X\times X\to\mathbb{R} +``` + +and a query point ``x\in X``, where ``X`` is the input space. It then computes ``s(x,y)`` for every point ``y`` in a database, and keeps the ``k`` points that are closest to ``x``. + +Broadly, there are two computational issues with this approach: + +- First, the database may be massive, much larger than could possibly fit in memory. This would make the brute-force approach of computing ``s(x,y)`` for every point ``y`` in the database far too expensive to be practical. +- Second, the dimensionality of the data may be such that computing ``s(x,y)`` is itself expensive. In addition, the similarity function itself may simply be intrinsically difficult to compute. For instance, calculating Wasserstein distance entails solving a very high-dimensional linear program. + +In order to solve these problems, researchers have over time developed a variety of techniques to accelerate similarity search: + +- [``k``-d trees](https://en.wikipedia.org/wiki/K-d_tree) +- [Ball trees](https://en.wikipedia.org/wiki/Ball_tree) +- Data reduction techniques + +## Locality-sensitive hashing +*Locality-sensitive hashing* (LSH) is a technique for accelerating similarity search that works by using a hash function on the query point ``x`` and limiting similarity search to only those points in the database that experience a hash collision with ``x``. The hash functions that are used are randomly generated from a family of *locality-sensitive hash functions*. These hash functions have the property that ``Pr[h(x) = h(y)]`` (i.e., the probability of a hash collision) increases the more similar that ``x`` and ``y`` are. + +LSH.jl is a package that provides definitions of locality-sensitive hash functions for a variety of different similarities. Currently, LSH.jl supports hash functions for + +- Cosine similarity (`cossim`) +- Jaccard similarity (`jaccard`) +- ``L^1`` (Manhattan / "taxicab") distance (`ℓ1`) +- ``L^2`` (Euclidean) distance (`ℓ2`) +- Inner product (`inner_prod`) +- Function-space hashes (`L1`, `L2`, and `cossim`)