-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add content to documentation homepage.
- Loading branch information
1 parent
50ddd3e
commit 9b7a765
Showing
1 changed file
with
33 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,35 @@ | ||
# LSH.jl | ||
|
||
Documentation for the LSH.jl package. | ||
LSH.jl is a Julia package for performing [locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) with various similarity functions. | ||
|
||
## Introduction | ||
One of the simplest methods for classifying, categorizing, and grouping data is to measure how similarities pairs of data points are. For instance, the classical [``k``-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) takes a similarity function | ||
|
||
```math | ||
s:X\times X\to\mathbb{R} | ||
``` | ||
|
||
and a query point ``x\in X``, where ``X`` is the input space. It then computes ``s(x,y)`` for every point ``y`` in a database, and keeps the ``k`` points that are closest to ``x``. | ||
|
||
Broadly, there are two computational issues with this approach: | ||
|
||
- First, the database may be massive, much larger than could possibly fit in memory. This would make the brute-force approach of computing ``s(x,y)`` for every point ``y`` in the database far too expensive to be practical. | ||
- Second, the dimensionality of the data may be such that computing ``s(x,y)`` is itself expensive. In addition, the similarity function itself may simply be intrinsically difficult to compute. For instance, calculating Wasserstein distance entails solving a very high-dimensional linear program. | ||
|
||
In order to solve these problems, researchers have over time developed a variety of techniques to accelerate similarity search: | ||
|
||
- [``k``-d trees](https://en.wikipedia.org/wiki/K-d_tree) | ||
- [Ball trees](https://en.wikipedia.org/wiki/Ball_tree) | ||
- Data reduction techniques | ||
|
||
## Locality-sensitive hashing | ||
*Locality-sensitive hashing* (LSH) is a technique for accelerating similarity search that works by using a hash function on the query point ``x`` and limiting similarity search to only those points in the database that experience a hash collision with ``x``. The hash functions that are used are randomly generated from a family of *locality-sensitive hash functions*. These hash functions have the property that ``Pr[h(x) = h(y)]`` (i.e., the probability of a hash collision) increases the more similar that ``x`` and ``y`` are. | ||
|
||
LSH.jl is a package that provides definitions of locality-sensitive hash functions for a variety of different similarities. Currently, LSH.jl supports hash functions for | ||
|
||
- Cosine similarity (`cossim`) | ||
- Jaccard similarity (`jaccard`) | ||
- ``L^1`` (Manhattan / "taxicab") distance (`ℓ1`) | ||
- ``L^2`` (Euclidean) distance (`ℓ2`) | ||
- Inner product (`inner_prod`) | ||
- Function-space hashes (`L1`, `L2`, and `cossim`) |