Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archipelago - dict transformer for vectorizing persistence diagrams #1017

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
de408e9
add archipelago class
Nov 28, 2023
927c789
give default quantiser sklearn.KMeans to Atol method
Dec 19, 2023
3e95419
homology_dimensions and settlers for Archipelago class
Dec 19, 2023
3bc232a
n_init parameter for sklearn KMeans
Dec 20, 2023
431a692
archipelago island
Dec 20, 2023
aaef59f
mature version compatible with gudhi.representations.vector_methods
Jan 4, 2024
adc8668
Atol try/catch
Jan 4, 2024
4685748
fix docstrings
Jan 4, 2024
2cf60e4
typo
Jan 4, 2024
6524320
Merge branch 'GUDHI:master' into archipelago
martinroyer Jan 4, 2024
5561d09
docstring correct
Jan 4, 2024
5367ea1
refactor removing input preprocessing, instead we take raw dgm format…
Jan 4, 2024
95bd156
prints
Jan 4, 2024
b4de687
Revert try/catch optimizer fit in Atol
martinroyer Jan 5, 2024
89be488
fix set_output from sklearn so as to return pandas without importing …
Jan 5, 2024
f5dc92d
default KMeans parameter
Jan 5, 2024
6bfb164
change confusing Atol __call__ function
Jan 5, 2024
23a5e47
define get_feature_names_out for Atol
Jan 5, 2024
a59af1b
test fixes
Jan 5, 2024
9fadd61
hopefully fix atol test following `n_init="auto"` in KMeans
Jan 5, 2024
0b1c6b9
revert value changes to doc
Jan 8, 2024
aa0d3cb
updated docstring
Jan 8, 2024
4afb0ef
tentative change n_init value for test compatibility 3.7
Jan 12, 2024
2df14d9
remove try except
Jan 12, 2024
f01ac19
call vectorizer get_geature_names_out if exists
Jan 12, 2024
c43b190
more sklearn logic
Jan 12, 2024
b74dcae
atol fixes:
Jun 7, 2024
26eef83
add test for representations interface fit/transform/...
Jun 7, 2024
5d8af99
remove archipelago
Jun 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/python/gudhi/representations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
from .metrics import *
from .preprocessing import *
from .vector_methods import *
from .archipelago import *

__all__ = ["kernel_methods", "metrics", "preprocessing", "vector_methods"]
__all__ = ["kernel_methods", "metrics", "preprocessing", "vector_methods", "archipelago"]
127 changes: 127 additions & 0 deletions src/python/gudhi/representations/archipelago.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# This file is part of the Gudhi Library - https://gudhi.inria.fr/ - which is released under MIT.
# See file LICENSE or go to https://gudhi.inria.fr/licensing/ for full license details.
# Author(s): Martin Royer

import copy

import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

from gudhi.representations.vector_methods import Atol, TopologicalVector, BettiCurve


class Archipelago(BaseEstimator, TransformerMixin):
"""
Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this may be more natural with the new format)
Maybe we should say that this class is helpful to avoid going through DimensionSelector+FeatureUnion or similar. It could even be nice to have an example somewhere comparing the 2 (we can add it later though).

One provides persistence diagram vectorizers (by way of either `island` or `island_dict`), and the Archipelago object will
|fit on| and |transform = vectorize| lists or series of persistence diagrams.
The object is sklearn-API consistent.

Parameters:
island: island for populating archipelago, i.e. object to vectorize the target in each homology
dimensions. Must be `copy.deepcopy`-able. *Will be ignored if island_list is given*.
island_dict: island dict for populating archipelago, i.e. dictionary of objects to vectorize persistence
diagrams according to homology_dimensions.

Examples
>>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit reluctant to add more code that uses the old format for persistence diagrams, when we are trying to replace it with something more convenient (see #925, https://github.com/GUDHI/gudhi-devel/blob/master/src/python/gudhi/sklearn/cubical_persistence.py or #691). Of course we could change archipelago later to accept the new format, but it may become complicated if we try to be compatible with both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better. Do you mean you want me to wait for those PR to come online before processing this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better.

It is the one we discussed before the holidays. Instead of a list of (dim,(birth,death)), if the user asked for homology in dimensions [i,j,k], they get [dgm_i, dgm_j, dgm_k] where each dgm is a numpy array of shape (*,2).

It is better in that you already have numpy arrays for each dimension. However, because it is a list and not a dict, you do not have access to the labels [i,j,k], the user has to provide them separately if you need them for something. It also makes it ambiguous, if we refer to the 0-th diagram, whether it corresponds to homology in dimension 0, or in dimension i (the first of the list).

Do you mean you want me to wait for those PR to come online before processing this one?

I don't know what I want 😉

>>> pdiagram2 = [(0, (0.0, 3.34)), (0, (0.0, 2.956)), (1, (0.536, 1.856)), (2, (1.202, 2.734))]
>>> pdiagram3 = [(0, (1.0, 4.34)), (0, (2.0, 3.956)), (1, (1.536, 2.856)), (2, (3.202, 4.734))]
>>> list_pdiags = [pdiagram1, pdiagram2, pdiagram3]
>>> archipelago = Archipelago(island=Atol())
>>> archipelago.fit(X=list_pdiags)
>>> archipelago.transform(X=list_pdiags)
>>> archipelago = Archipelago(island_dict={2: BettiCurve(resolution=4), 0:Atol()})
>>> import pandas as pd
>>> series_pdiags = pd.Series(list_pdiags)
>>> archipelago.set_output(transform="pandas")
>>> archipelago.fit(X=series_pdiags)
>>> archipelago.transform(X=series_pdiags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html, using the new format of persistence diagrams, and ignoring pandas for now, is this equivalent to the following?

from gudhi.representations import BettiCurve, Atol, Archipelago
import numpy as np
from sklearn.compose import ColumnTransformer
Xpdiagram1 = [np.array([[0.0, 2.34], [0.0, 0.956]]), np.array([[0.536, 0.856]]), np.array([[1.202, 1.734]])]
Xpdiagram2 = [np.array([[0.0, 3.34], [0.0, 2.956]]), np.array([[0.536, 1.856]]), np.array([[1.202, 2.734]])]
Xpdiagram3 = [np.array([[1.0, 3.34], [2.0, 2.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]
Xlist_pdiags = [Xpdiagram1, Xpdiagram2, Xpdiagram3]
ct=ColumnTransformer([("0",Atol(),0),("1",Atol(),1),("2",Atol(),2)])
print(ct.fit_transform(Xlist_pdiags))
ct=ColumnTransformer([("0",Atol(),0),("2",BettiCurve(resolution=4),2)])
print(ct.fit_transform(Xlist_pdiags))

I had to do a couple changes in BettiCurve to let this run (I don't know exactly what not X is meant to test for)

@@ -381,7 +381,7 @@
         if not self.is_fitted():
             raise NotFittedError("Not fitted.")
 
-        if not X:
+        if X is None or len(X) == 0:
             X = [np.zeros((0, 2))]
         
         N = len(X)
@@ -408,7 +408,7 @@
 
         return np.array(bettis, dtype=int)[:, 0:-1]
 
-    def fit_transform(self, X):
+    def fit_transform(self, X, y=None):
         """
         The result is the same as fit(X) followed by transform(X), but potentially faster.
         """

and I get similar results, but they differ from what Archipelago returns for dimension 0 (maybe it is just related to random generators).

Copy link
Collaborator Author

@martinroyer martinroyer Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, they differ because you made a mistake in copying it should be

Xpdiagram3 = [np.array([[1.0, 4.34], [2.0, 3.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]

and then the results are equal indeed.

Also this seems to be working:

ct.set_output(transform="pandas")
Xdf_pdiags = pd.DataFrame(Xlist_pdiags)
print(ct.fit_transform(Xdf_pdiags))

so this is possibly the end for this PR, with the new pdiags format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the one thing this PR does better than your proposition is handle dimensions that fail at fit, by not registering the key into the archipelago_ dict.
It seems desirable for now?
Which makes me think a lot of methods from vector_methods are not protected against things like
X is None or len(X) == 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further discussion with Marc, I believe the better philosophy for gudhi is what he suggests:

  • if one asks for vectorizing of some homology dimensions (say 0, 1 and 2) we provide the structure for the dimensions (for instance with ColumnTransformers from this conversation),
  • and because there can be no homology features at some dimensions, if it turns out there is no data to vectorize, we should print a warning (e.g. "no information in dimension d, problems ahead, consider asking for different homology dimensions") but still let it go to crashing, instead of what I was doing which was removing the problematic dimension.

So we can definitely drop this PR now.

We should (somehow?) salvage the changes to vector_methods: default method for Atol so that it can be initialized with empty arguments, and some tweaks that can be generalized to all vector methods such as implementing get_feature_names_out so that we can use the set_output feature from sklearn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should still include the other changes.

For reference, the discussion was mostly about how to handle empty diagrams in some dimensions. If the user asks for persistence in dimensions 0, 1 and 2, but the diagrams in dimension 2 are all empty (especially during fit), we have several options:

  1. we can throw an exception (possibly just for the vectorizers that do need fitting like Atol, or Landscape without an explicit sample_range)
  2. we can (print a warning and) continue, and for this dimension transform will either (preferably) return a harmless constant, or return nonsense (including NaN).
  3. we can (print a warning and) skip that dimension henceforth, transform will act as if the user had only asked for dimensions 0 and 1.

@martinroyer chose the last option in this PR. From a pure ML perspective, that option is convenient. A user can ask the pipeline to aggregate the results in several dimensions and not care about the details, if one dimension is useless and the pipeline can automatically ignore it, that user is happy. Ignoring that dimension also has the advantage over returning a constant that it avoids a useless increase in dimension. It also goes well with the dictionary-like pandas.
From a lower level perspective, the last option can be confusing. The output of transform does not have the expected shape (I can ask for 3 landscapes with 5 points each but get a vector of dimension 10 because one dimension was ignored).
Both options have their advantages. Note that this choice can be handled either inside each vectorizer (say Atol(skip_on_fit_error=True) for which fit([]) defines an empty list of centers, and after that transform returns a list of empty arrays, which hopefully disappear when ColumnTransformer tries to append them to the rest), or in a wrapper like ColumnTransformer.
In any case, we should check how we handle (lists of) empty diagrams.

Copy link
Collaborator Author

@martinroyer martinroyer Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PS: redrafted to reflect that this is consistent with @mglisse's option number 2]

Adding to that, I believe there are two "empty diagrams" cases to discuss, that of fit and that of transform.

As for fit, if we hold the view that once someone asks for vectorizing (say in dimensions 0 and 1) we should provide the structure for doing so (Marc's option 2 "provide the structure at all costs"):
for Atol that implies that even without data there should always be centers at the end of fit - so owe could for instance do random generation:

    def fit(self, X, y=None, sample_weight=None):
        if not hasattr(self.quantiser, 'fit'):
            raise TypeError("quantiser %s has no `fit` attribute." % (self.quantiser))

        # In fitting we remove infinite birth/death time points so that every center is finite
        X = [dgm[~np.isinf(dgm).any(axis=1), :] for dgm in X if len(dgm)]

        if len(X) < self.quantiser.n_clusters:
            # If no point to fit, let's arbitrarily put centers in [0, 1)
            X += [np.array([[np.random.random(), np.random.random()]]) for _ in range(self.quantiser.n_clusters - len(X))]

(This would ensure that quantiser has enough points to fit n_clusters regardless of entry X.)

As for transform, one convention could be that empty diagrams are thought of as empty measure therefore vectorized as 0 = no mass. This ensures that no error is thrown which would be pretty desirable at transform time?
For Atol one way of achieving this is to replace empty diagrams with np.array([[np.inf, np.inf]]).

That looks good to me.

Copy link
Collaborator

@MathieuCarriere MathieuCarriere Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am globally in favor of option (2), that is, return a constant for all data instance if the corresponding persistence diagrams are all empty. I think it matches better with ML practices, where you usually use homology dimensions without thinking too much beforehand if the persistence diagrams make sense or not. If they do not (and thus constants are returned), pipelines will not crash, but will contain redundant, useless constants, that can be spotted a posteriori with post-analysis of the results, as is sometimes done in other contexts.

That being said, I don't know if this is the behavior of the different representation methods that we have...

"""
_parameter_constraints = {
"island": [object, None],
"island_dict": [dict, None],
}

def __init__(
self,
island=None,
island_dict=None
):
self.island = island
self.island_dict = island_dict

def fit(self, X, y=None):
"""
Calibration step: create and fit `island` vectorizer to each matching diagram element

Args:
X (list of diagrams): input persistence diagrams to fit vectorizers on.
y: possibly labels for each diagram

Returns:
self
"""
self._validate_params()
if self.island is None and self.island_dict is None:
self.island = Atol()
self.archipelago_ = {}
self._running_transform_names = ""

max_dimension = max(dim for pdiagram in X for (dim, _) in pdiagram)
by_dim_list_pdiags = [[
np.array([_ for (dim, _) in pdiagram if dim == dimension]) for dimension in range(0, max_dimension + 1)
] for pdiagram in X]
for dimension in range(0, max_dimension + 1):
this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags]
if not len(this_dim_list_pdiags):
continue
if self.island_dict is not None and dimension in self.island_dict.keys():
island = self.island_dict[dimension]
elif self.island_dict is not None:
continue
else:
island = copy.deepcopy(self.island)
island.fit(X=this_dim_list_pdiags, y=y)
print(f"[Archipelago] Fit of homology dimension {dimension} with object {island.__class__} succeeded.")
self.archipelago_[dimension] = island
return self

def transform(self, X, y=None):
"""
Apply measure vectorisation on a dictionary of list of measures.

Args:
X (list of diagrams): input persistence diagrams to vectorize.
y: Ignored, present for API consistency by convention.

Returns:
vectors : array of shape (len(X), n_features) where the columns features are vectorized homology dimension
in increasing order.
"""

max_dimension = max(dim for pdiagram in X for (dim, _) in pdiagram)
by_dim_list_pdiags = [[
np.array([_ for (dim, _) in pdiagram if dim == dimension]) for dimension in range(0, max_dimension + 1)
] for pdiagram in X]

archipelago_vectorized = []
running_transform_names = []
for dimension in range(0, max_dimension + 1):
if dimension not in self.archipelago_.keys():
print(f"[Archipelago] Encounters homology dimension {dimension} that has not been fitted on. Will ignore this key")
continue
this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags]
vectorized_dgms = self.archipelago_[dimension].transform(this_dim_list_pdiags)
if hasattr(self.archipelago_[dimension], 'get_feature_names_out') and callable(self.archipelago_[dimension].get_feature_names_out):
this_dim_names = self.archipelago_[dimension].get_feature_names_out()
else:
this_dim_names = [f"Feat{i + 1}" for i in range(vectorized_dgms.shape[1])]
running_transform_names += [f"(D{dimension}) {name}" for name in this_dim_names]
archipelago_vectorized.append(vectorized_dgms)
self._running_transform_names = running_transform_names
return np.concatenate(archipelago_vectorized, axis=1)

def get_feature_names_out(self):
return self._running_transform_names
26 changes: 20 additions & 6 deletions src/python/gudhi/representations/vector_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
from sklearn.exceptions import NotFittedError
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler
from sklearn.metrics import pairwise
from sklearn.cluster import KMeans

try:
# New location since 1.0
from sklearn.metrics import DistanceMetric
Expand Down Expand Up @@ -719,21 +721,26 @@ class Atol(BaseEstimator, TransformerMixin):
>>> a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]])
>>> b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
>>> c = np.array([[3, 2, -1], [1, 2, -1]])
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))
Comment on lines -722 to +724
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau I believe this is what you need part1

>>> atol_vectoriser.fit(X=[a, b, c]).centers
array([[ 2.6 , 2.8 , -0.4 ],
[ 2. , 0.66666667, 3.33333333]])
>>> atol_vectoriser(a)
>>> atol_vectoriser._transform(a)
array([0.42375966, 1.18168665])
>>> atol_vectoriser(c)
>>> atol_vectoriser._transform(c)
array([1.25157463, 0.02062512])
>>> atol_vectoriser.transform(X=[a, b, c])
array([[0.42375966, 1.18168665],
[1.06330156, 0.29861028],
[1.25157463, 0.02062512]])
"""
# Note the example above must be up to date with the one in tests called test_atol_doc
def __init__(self, quantiser, weighting_method="cloud", contrast="gaussian"):
def __init__(
self,
quantiser=KMeans(n_clusters=2, random_state=202312, n_init=10),
weighting_method="cloud",
contrast="gaussian"
):
"""
Constructor for the Atol measure vectorisation class.

Expand All @@ -751,6 +758,7 @@ def __init__(self, quantiser, weighting_method="cloud", contrast="gaussian"):
self.quantiser = quantiser
self.contrast = contrast
self.weighting_method = weighting_method
self._running_transform_names = ""

def get_contrast(self):
return {
Expand Down Expand Up @@ -790,7 +798,9 @@ def fit(self, X, y=None, sample_weight=None):

measures_concat = np.concatenate(X)
weights_concat = np.concatenate(sample_weight)

self.quantiser.fit(X=measures_concat, sample_weight=weights_concat)

self.centers = self.quantiser.cluster_centers_
# Hack, but some people are unhappy if the order depends on the version of sklearn
self.centers = self.centers[np.lexsort(self.centers.T)]
Expand All @@ -805,7 +815,7 @@ def fit(self, X, y=None, sample_weight=None):
self.inertias = np.min(dist_centers, axis=0)/2
return self

def __call__(self, measure, sample_weight=None):
def _transform(self, measure, sample_weight=None):
"""
Apply measure vectorisation on a single measure. Only available after `fit` has been called.

Expand Down Expand Up @@ -834,4 +844,8 @@ def transform(self, X, sample_weight=None):
"""
if sample_weight is None:
sample_weight = [self.get_weighting_method()(measure) for measure in X]
return np.stack([self(measure, sample_weight=weight) for measure, weight in zip(X, sample_weight)])
self._running_transform_names = [f"Atol Center {i + 1}" for i in range(self.quantiser.n_clusters)]
return np.stack([self._transform(measure, sample_weight=weight) for measure, weight in zip(X, sample_weight)])

def get_feature_names_out(self):
return self._running_transform_names
16 changes: 8 additions & 8 deletions src/python/test/test_representations.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,23 +118,23 @@ def test_atol_doc():
b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
c = np.array([[3, 2, -1], [1, 2, -1]])

atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))
# Atol will do
# X = np.concatenate([a,b,c])
# kmeans = KMeans(n_clusters=2, random_state=202006).fit(X)
# kmeans = KMeans(n_clusters=2, random_state=202006, n_init=10).fit(X)
# kmeans.labels_ will be : array([1, 0, 1, 0, 0, 1, 0, 0])
first_cluster = np.asarray([a[0], a[2], b[2]])
second_cluster = np.asarray([a[1], b[0], b[2], c[0], c[1]])
second_cluster = np.asarray([a[1], b[0], b[1], c[0], c[1]])
Comment on lines -121 to +127
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau I believe this is what you need part2


# Check the center of the first_cluster and second_cluster are in Atol centers
centers = atol_vectoriser.fit(X=[a, b, c]).centers
np.isclose(centers, first_cluster.mean(axis=0)).all(1).any()
np.isclose(centers, second_cluster.mean(axis=0)).all(1).any()

vectorization = atol_vectoriser.transform(X=[a, b, c])
assert np.allclose(vectorization[0], atol_vectoriser(a))
assert np.allclose(vectorization[1], atol_vectoriser(b))
assert np.allclose(vectorization[2], atol_vectoriser(c))
assert np.allclose(vectorization[0], atol_vectoriser._transform(a))
assert np.allclose(vectorization[1], atol_vectoriser._transform(b))
assert np.allclose(vectorization[2], atol_vectoriser._transform(c))


def test_dummy_atol():
Expand All @@ -145,12 +145,12 @@ def test_dummy_atol():
for weighting_method in ["cloud", "iidproba"]:
for contrast in ["gaussian", "laplacian", "indicator"]:
atol_vectoriser = Atol(
quantiser=KMeans(n_clusters=1, random_state=202006),
quantiser=KMeans(n_clusters=1, random_state=202006, n_init=10),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau I believe this is what you need part3/3

weighting_method=weighting_method,
contrast=contrast,
)
atol_vectoriser.fit([a, b, c])
atol_vectoriser(a)
atol_vectoriser._transform(a)
atol_vectoriser.transform(X=[a, b, c])


Expand Down