Skip to content

Commit

Permalink
Merge pull request #1153 from jacobgolding/master
Browse files Browse the repository at this point in the history
Landmarked Re-Trainable Parametric UMAP
  • Loading branch information
lmcinnes authored Oct 19, 2024
2 parents c9dcc15 + 80d4bae commit f123b91
Show file tree
Hide file tree
Showing 39 changed files with 1,321 additions and 280 deletions.
8 changes: 7 additions & 1 deletion doc/api.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
UMAP API Guide
==============

UMAP has only a single class :class:`UMAP`.
UMAP has only two classes, :class:`UMAP`, and :class:`ParametricUMAP`, which inherits from it.

UMAP
----

.. autoclass:: umap.umap_.UMAP
:members:

ParametricUMAP
----

.. autoclass:: umap.parametric_umap.ParametricUMAP
:members:

A number of internal functions can also be accessed separately for more fine tuned work.

Useful Functions
Expand Down
4 changes: 2 additions & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import os
import sys

sys.path.insert(0, os.path.abspath('.'))
sys.path.insert(0, os.path.abspath('..'))
sys.path.insert(0, os.path.abspath("."))
sys.path.insert(0, os.path.abspath(".."))


# -- General configuration ------------------------------------------------
Expand Down
Binary file added doc/images/retrain_pumap_emb_x1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_emb_x2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_p_emb_x1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_p_emb_x2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_summary_2_removed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ PyPI install, presuming you have numba and sklearn and all its requirements
transform
inverse_transform
parametric_umap
transform_landmarked_pumap
sparse
supervised
clustering
Expand Down
6 changes: 5 additions & 1 deletion doc/parametric_umap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ This loads both the UMAP object and the parametric networks it contains.

Plotting loss
-------------
Parametric UMAP monitors loss during training using Keras. That loss will be printed after each epoch during training. This loss is saved in :python:`embedder.history`, and can be plotted:
Parametric UMAP monitors loss during training using Keras. That loss will be printed after each epoch during training. This loss is saved in :python:`embedder._history`, and can be plotted:

.. code:: python3
Expand All @@ -103,6 +103,8 @@ Parametric UMAP monitors loss during training using Keras. That loss will be pri
.. image:: images/umap-loss.png

Much like other keras models, if you continue to train your model via the :python:`fit` method of the model, the :python:`embedder._history` will be updated with further training epoch losses.

Parametric inverse_transform (reconstruction)
---------------------------------------------
To use a second neural network to learn an inverse mapping between data and embeddings, we simply need to pass `parametric_reconstruction= True` to the ParametricUMAP.
Expand Down Expand Up @@ -205,6 +207,8 @@ Additional important parameters
* **optimizer:** The optimizer used to train the neural network. by default Adam (:python:`tf.keras.optimizers.Adam(1e-3)`) is used. You might be able to speed up or improve training by using a different optimizer.
* **parametric_embedding:** If set to false, a non-parametric embedding is learned, using the same code as the parametric embedding, which can serve as a direct comparison between parametric and non-parametric embedding using the same optimizer. The parametric embeddings are performed over the entire dataset simultaneously.
* **global_correlation_loss_weight:** Whether to additionally train on correlation of global pairwise relationships (multidimensional scaling)
* **landmark_loss_fn:** The loss function to use when re-training on landmarked data, where you have provided a desired location in the embedding space to the :python:`fit` method of the model. By default, euclidean loss is used. For more information on re-training, landmarks, and why you might use them, see :doc:`transform_landmarked_pumap`.
* **landmark_loss_weight:** How to weight the landmark loss relative to umap loss, by default 1.0.

Extending the model
-------------------
Expand Down
5 changes: 5 additions & 0 deletions doc/transform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ the latent space the classifier uses. Fortunately UMAP makes this
possible, albeit more slowly than some other transformers that allow
this.

This tutorial will step through a simple case where we expect the overall
distribution in our higher-dimensional vectors to be consistent between the
training and testing data. For more detail on how this can go wrong, and
how we can fix it using Parametric UMAP, see :doc:`transform_landmarked_pumap`.

To demonstrate this functionality we'll make use of
`scikit-learn <http://scikit-learn.org/stable/index.html>`__ and the
digits dataset contained therein (see :doc:`basic_usage` for an example
Expand Down
227 changes: 227 additions & 0 deletions doc/transform_landmarked_pumap.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@

Transforming New Data with Parametric UMAP
==========================================

There are many cases where one may want to take an existing UMAP model and use it to embed new data into the learned space. For a simple example where the overall distribution of the higher-dimensional training data matches that of the new data being embedded, see :doc:`transform`. We can't always be sure that this will be the case, however. To simulate a case where we have novel behaviour that we want to include in our embedding space, we will use the MNIST digits dataset (see :doc:`basic_usage` for a basic example).

To follow along with this example, see the MNIST_Landmarks notebook on the `GitHub repository <https://github.com/lmcinnes/umap/tree/master/notebooks/>`_

.. code :: python3
import keras
from sklearn.model_selection import train_test_split
from umap import UMAP, ParametricUMAP
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
We'll start by loading in the dataset, and splitting it into 2 equal parts with ``sklearn``'s ``train_test_split`` function. This will give us two partitions to work with, one to train our original embedding and another to test it. In order to simulate new behaviour appearing in our data we remove one of the MNIST categories ``N`` from the ``x1`` partition. In this case we'll use ``N=2``, so our model will be trained on all of the digits other than 2.

.. code:: python3
(X, y), (_, _) = keras.datasets.mnist.load_data()
x1, x2, y1, y2 = train_test_split(X, y, test_size=0.5, random_state=42)
# Reshape to 1D vectors
x1 = x1.reshape((x1.shape[0], 28*28))
x2 = x2.reshape((x2.shape[0], 28*28))
# Remove one category from the train dataset.
# In the case of MNIST digits, this will be the digit we are removing.
N = 2
x1 = x1[y1 != N]
y1 = y1[y1 != N]
print(x1.shape, x2.shape)
.. parsed-literal::
(26995, 784) (30000, 784)
New data with UMAP
------------------

To start with, we'll identify the issues with using UMAP as-is in this case, and then we'll see how to fix them with Parametric UMAP. First off, we need to train a ``UMAP`` model on our ``x1`` partition:

.. code:: python3
embedder = UMAP()
emb_x1 = embedder.fit_transform(x1)
Visualising our results:

.. code:: python3
plt.scatter(emb_x1[:,0], emb_x1[:,1], c=y1, cmap='Spectral', s=2, alpha=0.2)
.. image:: images/retrain_pumap_emb_x1.png


This is a clean and successful embedding, as we would expect from UMAP on this relatively-simple example. We see the normal structure one would expect from embedding MNIST, but without any of the 2s. The ``UMAP`` class is built to be compatible with ``scikit-learn``, so passing new data through is as simple as using the ``transform`` method and passing through the new data. We'll pass through ``x2``, which contains unseen examples of the original classes, and also samples from our holdout class, ``N`` (the 2s).

To make samples from ``N`` Stand out more, we'll over-plot them in black.

.. code:: python3
emb_x2 = embedder.transform(x2)
.. code:: python3
plt.scatter(emb_x2[:,0], emb_x2[:,1], c=y2, cmap='Spectral', s=2, alpha=0.2)
plt.scatter(emb_x2[y2==N][:,0], emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5)
.. image:: images/retrain_pumap_emb_x2.png

While our ``UMAP`` embedder has correctly handled the classes present in ``x1`` it has treated examples from our holdout class ``N`` poorly. Many of these points are concentrated on top of existing classes, with some spread out between them. This inability to generalize is not unique to UMAP, but is more generally a difficulty with learned embeddings. It also may or may not be an issue, depending on your use case.

New data with Parametric UMAP
-----------------------------

We can improve this outcome with Parametric UMAP. Parametric UMAP differs from UMAP in that it learns the relationship between the data and embedding with a neural network, instead of learning embeddings directly. This means we can incorporate new data by continuing to train the neural network, updating the weights to incorporate our new information.

.. image:: images/pumap-only.png

For more complete information on Parametric UMAP and the many options it provides, see :doc:`parametric_umap`.

We will start adressing this by training a ``ParametricUMAP`` embedding model, and running the same experiment:

.. code:: python3
p_embedder = ParametricUMAP()
p_emb_x1 = p_embedder.fit_transform(x1)
.. code:: python3
plt.scatter(p_emb_x1[:,0], p_emb_x1[:,1], c=y1, cmap='Spectral', s=2, alpha=0.2)
.. image:: images/retrain_pumap_p_emb_x1.png

Again, we get good results on our initial embedding of ``x1``. If we pass ``x2`` through without re-training, we get a similar problem to our ``UMAP`` model:

.. code:: python3
p_emb_x2 = p_embedder.transform(x2)
.. code:: python3
plt.scatter(p_emb_x2[:,0], p_emb_x2[:,1], c=y2, cmap='Spectral', s=2, alpha=0.2)
plt.scatter(p_emb_x2[y2==N][:,0], p_emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5)
.. image:: images/retrain_pumap_p_emb_x2.png

Re-training Parametric UMAP with landmarks
------------------------------------------

To update our embedding to include the new class, we'll fine-tune our existing ``ParametricUMAP`` model. Doing this without any other changes will start from where we left off, but our embedding space's structure may drift and change. This is because the UMAP loss function is invariant to translation and rotation, as it is only concerned with the relative positions and distances between points.

In order to keep our embedding space more consistent, we'll use the landmarks option for ``ParametricUMAP``. We retrain the model on the ``x2`` partition, along with some points chosen as landmarks from ``x1``. We'll choose 1% of the samples in ``x1`` to be included, along with their current position in the embedding space to be used in the landmarks loss function.

The default ``landmark_loss_fn`` is the euclidean distance between the point's original position and it's current one. The only change we'll make is to set ``landmark_loss_weight=0.01``.

.. code:: python3
# Select landmarks indexes from x1.
#
landmark_idx = list(np.random.choice(range(x1.shape[0]), int(x1.shape[0]/100), replace=False))
# Add the landmark points to x2 for training.
#
x2_lmk = np.concatenate((x2, x1[landmark_idx]))
y2_lmk = np.concatenate((y2, y1[landmark_idx]))
# Make our landmarks vector, which is nan where we have no landmark information.
#
landmarks = np.stack(
[np.array([np.nan, np.nan])]*x2.shape[0] + list(
p_embedder.transform(
x1[landmark_idx]
)
)
)
# Set landmark loss weight and continue training our Parametric UMAP model.
#
p_embedder.landmark_loss_weight = 0.01
p_embedder.fit(x2_lmk, landmark_positions=landmarks)
p_emb2_x2 = p_embedder.transform(x2)
# Check how x1 looks when embedded in the space retrained on x2 and landmarks.
#
p_emb2_x1 = p_embedder.transform(x1)
Plotting all of the different embeddings to compare them:

.. code:: python3
fig, axs = plt.subplots(3, 2, figsize=(16, 24), sharex=True, sharey=True)
axs[0,0].scatter(
emb_x1[:, 0], emb_x1[:, 1], c=y1, cmap='Spectral', s=2, alpha=0.2,
)
axs[0,0].set_ylabel('UMAP Embedding', fontsize=20)
axs[0,1].scatter(
emb_x2[:, 0], emb_x2[:, 1], c=y2, cmap='Spectral', s=2, alpha=0.2,
)
axs[0,1].scatter(
emb_x2[y2==N][:,0], emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5,
)
axs[1,0].scatter(
p_emb_x1[:, 0], p_emb_x1[:, 1], c=y1, cmap='Spectral', s=2, alpha=0.2,
)
axs[1,0].set_ylabel('Initial P-UMAP Embedding', fontsize=20)
axs[1,1].scatter(
p_emb_x2[:, 0], p_emb_x2[:, 1], c=y2, cmap='Spectral', s=2, alpha=0.2,
)
axs[1,1].scatter(
p_emb_x2[y2==N][:,0], p_emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5
)
axs[2,0].scatter(
p_emb2_x1[:, 0], p_emb2_x1[:, 1], c=y1, cmap='Spectral', s=2, alpha=0.2,
)
axs[2,0].set_ylabel('Updated P-UMAP Embedding', fontsize=20)
axs[2,0].set_xlabel(f'x1, No {N}s', fontsize=20)
axs[2,1].scatter(
p_emb2_x2[:, 0], p_emb2_x2[:, 1], c=y2, cmap='Spectral', s=2, alpha=0.2,
)
axs[2,1].scatter(
p_emb2_x2[y2==N][:,0], p_emb2_x2[y2==N][:,1], c='k', s=2, alpha=0.5,
)
axs[2,1].set_xlabel('x2, All Classes', fontsize=20)
plt.tight_layout()
.. image:: images/retrain_pumap_summary_2_removed.png

Here we see that our approach has been successful, The embedding space has been kept consistent and we now have a clear cluster of our new class, the 2s. This new cluster shows up in a sensible part of the embedding space, and the rest of the structure is preserved.

It is worth double checking here that the landmark loss is not too constraining, we still would like a good UMAP structure.
To do so, we can interrogate the history of our embedder, which will retain the history through our re-training steps.

.. code:: python3
plt.plot(p_embedder._history['loss'])
plt.ylabel('Loss')
plt.xlabel('Epoch')
.. image:: images/retrain_pumap_history.png

We can identify the spike in loss where we introduce ``x2``, and can confirm that the resulting loss is comparable to the loss from our initial training on ``x1``. This tells us that the model is not having to compromise too much between the UMAP loss and the landmark loss. If this were not the case, it could potentially be improved by lowering the ``landmark_loss_weight`` attribute of our embedder object. There is a tradeoff to be made here between the consistency of the space and minimizing UMAP loss, but the key is we have smooth variation in the embedding space, which will make downstream tasks easier to adjust. In this case, we could probably stand to increase the ``landmark_loss_weight`` to keep the space more consistent.

In addition to ``landmark_loss_weight``, there are a number of other options available to us to try and get better results on this or other examples:

- Continuing the training with a larger portion of points from the original data, in our case ``x1``. Not all of these points need to be landmarked, but they can contribute to a consistent graph structure in higher dimensions.
- Changing the ``landmark_loss_fn``. For example, if we want to allow for points to move if they have to we could truncate the default euclidean loss function, allowing the metaphorical rubber band to snap at a certain point and prioritising a good UMAP structure once we discover that sticking to the landmark position is not correct.
- Being more intelligent with our selection of landmark points, for example using submodular optimization with a package like `apricot-select <https://apricot-select.readthedocs.io/en/latest/>`__ or chosing points from different parts of a heirarchical clustering like `HDBSCAN <https://hdbscan.readthedocs.io/en/latest/index.html>`__

6 changes: 3 additions & 3 deletions examples/mnist_torus_sphere_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def torus_euclidean_grad(x, y, torus_dimensions=(2 * np.pi, 2 * np.pi)):
for i in range(x.shape[0]):
a = abs(x[i] - y[i])
if 2 * a < torus_dimensions[i]:
distance_sqr += a ** 2
distance_sqr += a**2
g[i] = x[i] - y[i]
else:
distance_sqr += (torus_dimensions[i] - a) ** 2
Expand All @@ -74,7 +74,7 @@ def torus_euclidean_grad(x, y, torus_dimensions=(2 * np.pi, 2 * np.pi)):
# Plot a torus
R = 2
r = 1
values = (R - np.sqrt(x ** 2 + y ** 2)) ** 2 + z ** 2 - r ** 2
values = (R - np.sqrt(x**2 + y**2)) ** 2 + z**2 - r**2
mlab.contour3d(x, y, z, values, color=(1.0, 1.0, 1.0), contours=[0])

# torus angles -> 3D
Expand Down Expand Up @@ -105,7 +105,7 @@ def torus_euclidean_grad(x, y, torus_dimensions=(2 * np.pi, 2 * np.pi)):

# Plot a sphere
r = 3
values = x ** 2 + y ** 2 + z ** 2 - r ** 2
values = x**2 + y**2 + z**2 - r**2
mlab.contour3d(x, y, z, values, color=(1.0, 1.0, 1.0), contours=[0])

# latitude, longitude -> 3D
Expand Down
1 change: 1 addition & 0 deletions examples/plot_algorithm_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
the equator and black to white from the south
to north pole.
"""

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Expand Down
1 change: 1 addition & 0 deletions examples/plot_fashion-mnist_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
(as shown in this example), or by continuous variables,
or by density (as is common in datashader examples).
"""

import umap
import numpy as np
import pandas as pd
Expand Down
5 changes: 3 additions & 2 deletions examples/plot_feature_extraction_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
used as a feature extraction technique. This small change results in a
substantial improvement compared to the model where raw data is used.
"""

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
Expand All @@ -45,7 +46,7 @@

# Classification with a linear SVM
svc = LinearSVC(dual=False, random_state=123)
params_grid = {"C": [10 ** k for k in range(-3, 4)]}
params_grid = {"C": [10**k for k in range(-3, 4)]}
clf = GridSearchCV(svc, params_grid)
clf.fit(X_train, y_train)
print(
Expand All @@ -58,7 +59,7 @@
params_grid_pipeline = {
"umap__n_neighbors": [5, 20],
"umap__n_components": [15, 25, 50],
"svc__C": [10 ** k for k in range(-3, 4)],
"svc__C": [10**k for k in range(-3, 4)],
}


Expand Down
1 change: 1 addition & 0 deletions examples/plot_mnist_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
0, and grouping triplets of 3,5,8 and 4,7,9 which can
blend into one another in some cases.
"""

import umap
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
Expand Down
Loading

0 comments on commit f123b91

Please sign in to comment.