-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archipelago - dict transformer for vectorizing persistence diagrams #1017
Changes from 26 commits
de408e9
927c789
3e95419
3bc232a
431a692
aaef59f
adc8668
4685748
2cf60e4
6524320
5561d09
5367ea1
95bd156
b4de687
89be488
f5dc92d
6bfb164
23a5e47
a59af1b
9fadd61
0b1c6b9
aa0d3cb
4afb0ef
2df14d9
f01ac19
c43b190
b74dcae
26eef83
5d8af99
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
# This file is part of the Gudhi Library - https://gudhi.inria.fr/ - which is released under MIT. | ||
# See file LICENSE or go to https://gudhi.inria.fr/licensing/ for full license details. | ||
# Author(s): Martin Royer | ||
|
||
import copy | ||
|
||
import numpy as np | ||
|
||
from sklearn.base import BaseEstimator, TransformerMixin | ||
|
||
from gudhi.representations.vector_methods import Atol, TopologicalVector, BettiCurve | ||
|
||
|
||
class Archipelago(BaseEstimator, TransformerMixin): | ||
""" | ||
Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods. | ||
One provides persistence diagram vectorizers (by way of either `island` or `island_dict`), and the Archipelago object will | ||
|fit on| and |transform = vectorize| lists or series of persistence diagrams. | ||
The object is sklearn-API consistent. | ||
|
||
Parameters: | ||
island: island for populating archipelago, i.e. object to vectorize the target in each homology | ||
dimensions. Must be `copy.deepcopy`-able. *Will be ignored if island_list is given*. | ||
island_dict: island dict for populating archipelago, i.e. dictionary of objects to vectorize persistence | ||
diagrams according to homology_dimensions. | ||
|
||
Examples | ||
>>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a bit reluctant to add more code that uses the old format for persistence diagrams, when we are trying to replace it with something more convenient (see #925, https://github.com/GUDHI/gudhi-devel/blob/master/src/python/gudhi/sklearn/cubical_persistence.py or #691). Of course we could change archipelago later to accept the new format, but it may become complicated if we try to be compatible with both. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better. Do you mean you want me to wait for those PR to come online before processing this one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is the one we discussed before the holidays. Instead of a list of (dim,(birth,death)), if the user asked for homology in dimensions [i,j,k], they get [dgm_i, dgm_j, dgm_k] where each dgm is a numpy array of shape (*,2). It is better in that you already have numpy arrays for each dimension. However, because it is a list and not a dict, you do not have access to the labels [i,j,k], the user has to provide them separately if you need them for something. It also makes it ambiguous, if we refer to the 0-th diagram, whether it corresponds to homology in dimension 0, or in dimension i (the first of the list).
I don't know what I want 😉 |
||
>>> pdiagram2 = [(0, (0.0, 3.34)), (0, (0.0, 2.956)), (1, (0.536, 1.856)), (2, (1.202, 2.734))] | ||
>>> pdiagram3 = [(0, (1.0, 4.34)), (0, (2.0, 3.956)), (1, (1.536, 2.856)), (2, (3.202, 4.734))] | ||
>>> list_pdiags = [pdiagram1, pdiagram2, pdiagram3] | ||
>>> archipelago = Archipelago(island=Atol()) | ||
>>> archipelago.fit(X=list_pdiags) | ||
>>> archipelago.transform(X=list_pdiags) | ||
>>> archipelago = Archipelago(island_dict={2: BettiCurve(resolution=4), 0:Atol()}) | ||
>>> import pandas as pd | ||
>>> series_pdiags = pd.Series(list_pdiags) | ||
>>> archipelago.set_output(transform="pandas") | ||
>>> archipelago.fit(X=series_pdiags) | ||
>>> archipelago.transform(X=series_pdiags) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html, using the new format of persistence diagrams, and ignoring pandas for now, is this equivalent to the following? from gudhi.representations import BettiCurve, Atol, Archipelago
import numpy as np
from sklearn.compose import ColumnTransformer
Xpdiagram1 = [np.array([[0.0, 2.34], [0.0, 0.956]]), np.array([[0.536, 0.856]]), np.array([[1.202, 1.734]])]
Xpdiagram2 = [np.array([[0.0, 3.34], [0.0, 2.956]]), np.array([[0.536, 1.856]]), np.array([[1.202, 2.734]])]
Xpdiagram3 = [np.array([[1.0, 3.34], [2.0, 2.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]
Xlist_pdiags = [Xpdiagram1, Xpdiagram2, Xpdiagram3]
ct=ColumnTransformer([("0",Atol(),0),("1",Atol(),1),("2",Atol(),2)])
print(ct.fit_transform(Xlist_pdiags))
ct=ColumnTransformer([("0",Atol(),0),("2",BettiCurve(resolution=4),2)])
print(ct.fit_transform(Xlist_pdiags)) I had to do a couple changes in BettiCurve to let this run (I don't know exactly what @@ -381,7 +381,7 @@
if not self.is_fitted():
raise NotFittedError("Not fitted.")
- if not X:
+ if X is None or len(X) == 0:
X = [np.zeros((0, 2))]
N = len(X)
@@ -408,7 +408,7 @@
return np.array(bettis, dtype=int)[:, 0:-1]
- def fit_transform(self, X):
+ def fit_transform(self, X, y=None):
"""
The result is the same as fit(X) followed by transform(X), but potentially faster.
""" and I get similar results, but they differ from what Archipelago returns for dimension 0 (maybe it is just related to random generators). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, they differ because you made a mistake in copying it should be
and then the results are equal indeed. Also this seems to be working:
so this is possibly the end for this PR, with the new pdiags format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the one thing this PR does better than your proposition is handle dimensions that fail at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After further discussion with Marc, I believe the better philosophy for gudhi is what he suggests:
So we can definitely drop this PR now. We should (somehow?) salvage the changes to vector_methods: default method for Atol so that it can be initialized with empty arguments, and some tweaks that can be generalized to all vector methods such as implementing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we should still include the other changes. For reference, the discussion was mostly about how to handle empty diagrams in some dimensions. If the user asks for persistence in dimensions 0, 1 and 2, but the diagrams in dimension 2 are all empty (especially during
@martinroyer chose the last option in this PR. From a pure ML perspective, that option is convenient. A user can ask the pipeline to aggregate the results in several dimensions and not care about the details, if one dimension is useless and the pipeline can automatically ignore it, that user is happy. Ignoring that dimension also has the advantage over returning a constant that it avoids a useless increase in dimension. It also goes well with the dictionary-like pandas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [PS: redrafted to reflect that this is consistent with @mglisse's option number 2] Adding to that, I believe there are two "empty diagrams" cases to discuss, that of As for
(This would ensure that quantiser has enough points to fit As for That looks good to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am globally in favor of option (2), that is, return a constant for all data instance if the corresponding persistence diagrams are all empty. I think it matches better with ML practices, where you usually use homology dimensions without thinking too much beforehand if the persistence diagrams make sense or not. If they do not (and thus constants are returned), pipelines will not crash, but will contain redundant, useless constants, that can be spotted a posteriori with post-analysis of the results, as is sometimes done in other contexts. That being said, I don't know if this is the behavior of the different representation methods that we have... |
||
""" | ||
_parameter_constraints = { | ||
"island": [object, None], | ||
"island_dict": [dict, None], | ||
} | ||
|
||
def __init__( | ||
self, | ||
island=None, | ||
island_dict=None | ||
): | ||
self.island = island | ||
self.island_dict = island_dict | ||
|
||
def fit(self, X, y=None): | ||
""" | ||
Calibration step: create and fit `island` vectorizer to each matching diagram element | ||
|
||
Args: | ||
X (list of diagrams): input persistence diagrams to fit vectorizers on. | ||
y: possibly labels for each diagram | ||
|
||
Returns: | ||
self | ||
""" | ||
self._validate_params() | ||
if self.island is None and self.island_dict is None: | ||
self.island = Atol() | ||
self.archipelago_ = {} | ||
self._running_transform_names = "" | ||
|
||
max_dimension = max(dim for pdiagram in X for (dim, _) in pdiagram) | ||
by_dim_list_pdiags = [[ | ||
np.array([_ for (dim, _) in pdiagram if dim == dimension]) for dimension in range(0, max_dimension + 1) | ||
] for pdiagram in X] | ||
for dimension in range(0, max_dimension + 1): | ||
this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags] | ||
if not len(this_dim_list_pdiags): | ||
continue | ||
if self.island_dict is not None and dimension in self.island_dict.keys(): | ||
island = self.island_dict[dimension] | ||
elif self.island_dict is not None: | ||
continue | ||
else: | ||
island = copy.deepcopy(self.island) | ||
island.fit(X=this_dim_list_pdiags, y=y) | ||
print(f"[Archipelago] Fit of homology dimension {dimension} with object {island.__class__} succeeded.") | ||
self.archipelago_[dimension] = island | ||
return self | ||
|
||
def transform(self, X, y=None): | ||
""" | ||
Apply measure vectorisation on a dictionary of list of measures. | ||
|
||
Args: | ||
X (list of diagrams): input persistence diagrams to vectorize. | ||
y: Ignored, present for API consistency by convention. | ||
|
||
Returns: | ||
vectors : array of shape (len(X), n_features) where the columns features are vectorized homology dimension | ||
in increasing order. | ||
""" | ||
|
||
max_dimension = max(dim for pdiagram in X for (dim, _) in pdiagram) | ||
by_dim_list_pdiags = [[ | ||
np.array([_ for (dim, _) in pdiagram if dim == dimension]) for dimension in range(0, max_dimension + 1) | ||
] for pdiagram in X] | ||
|
||
archipelago_vectorized = [] | ||
running_transform_names = [] | ||
for dimension in range(0, max_dimension + 1): | ||
if dimension not in self.archipelago_.keys(): | ||
print(f"[Archipelago] Encounters homology dimension {dimension} that has not been fitted on. Will ignore this key") | ||
continue | ||
this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags] | ||
vectorized_dgms = self.archipelago_[dimension].transform(this_dim_list_pdiags) | ||
if hasattr(self.archipelago_[dimension], 'get_feature_names_out') and callable(self.archipelago_[dimension].get_feature_names_out): | ||
this_dim_names = self.archipelago_[dimension].get_feature_names_out() | ||
else: | ||
this_dim_names = [f"Feat{i + 1}" for i in range(vectorized_dgms.shape[1])] | ||
running_transform_names += [f"(D{dimension}) {name}" for name in this_dim_names] | ||
archipelago_vectorized.append(vectorized_dgms) | ||
self._running_transform_names = running_transform_names | ||
return np.concatenate(archipelago_vectorized, axis=1) | ||
|
||
def get_feature_names_out(self): | ||
return self._running_transform_names |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,8 @@ | |
from sklearn.exceptions import NotFittedError | ||
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler | ||
from sklearn.metrics import pairwise | ||
from sklearn.cluster import KMeans | ||
|
||
try: | ||
# New location since 1.0 | ||
from sklearn.metrics import DistanceMetric | ||
|
@@ -719,21 +721,26 @@ class Atol(BaseEstimator, TransformerMixin): | |
>>> a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]]) | ||
>>> b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]]) | ||
>>> c = np.array([[3, 2, -1], [1, 2, -1]]) | ||
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006)) | ||
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10)) | ||
Comment on lines
-722
to
+724
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @VincentRouvreau I believe this is what you need part1 |
||
>>> atol_vectoriser.fit(X=[a, b, c]).centers | ||
array([[ 2.6 , 2.8 , -0.4 ], | ||
[ 2. , 0.66666667, 3.33333333]]) | ||
>>> atol_vectoriser(a) | ||
>>> atol_vectoriser._transform(a) | ||
array([0.42375966, 1.18168665]) | ||
>>> atol_vectoriser(c) | ||
>>> atol_vectoriser._transform(c) | ||
array([1.25157463, 0.02062512]) | ||
>>> atol_vectoriser.transform(X=[a, b, c]) | ||
array([[0.42375966, 1.18168665], | ||
[1.06330156, 0.29861028], | ||
[1.25157463, 0.02062512]]) | ||
""" | ||
# Note the example above must be up to date with the one in tests called test_atol_doc | ||
def __init__(self, quantiser, weighting_method="cloud", contrast="gaussian"): | ||
def __init__( | ||
self, | ||
quantiser=KMeans(n_clusters=2, random_state=202312, n_init=10), | ||
weighting_method="cloud", | ||
contrast="gaussian" | ||
): | ||
""" | ||
Constructor for the Atol measure vectorisation class. | ||
|
||
|
@@ -751,6 +758,7 @@ def __init__(self, quantiser, weighting_method="cloud", contrast="gaussian"): | |
self.quantiser = quantiser | ||
self.contrast = contrast | ||
self.weighting_method = weighting_method | ||
self._running_transform_names = "" | ||
|
||
def get_contrast(self): | ||
return { | ||
|
@@ -790,7 +798,9 @@ def fit(self, X, y=None, sample_weight=None): | |
|
||
measures_concat = np.concatenate(X) | ||
weights_concat = np.concatenate(sample_weight) | ||
|
||
self.quantiser.fit(X=measures_concat, sample_weight=weights_concat) | ||
|
||
self.centers = self.quantiser.cluster_centers_ | ||
# Hack, but some people are unhappy if the order depends on the version of sklearn | ||
self.centers = self.centers[np.lexsort(self.centers.T)] | ||
|
@@ -805,7 +815,7 @@ def fit(self, X, y=None, sample_weight=None): | |
self.inertias = np.min(dist_centers, axis=0)/2 | ||
return self | ||
|
||
def __call__(self, measure, sample_weight=None): | ||
def _transform(self, measure, sample_weight=None): | ||
""" | ||
Apply measure vectorisation on a single measure. Only available after `fit` has been called. | ||
|
||
|
@@ -834,4 +844,8 @@ def transform(self, X, sample_weight=None): | |
""" | ||
if sample_weight is None: | ||
sample_weight = [self.get_weighting_method()(measure) for measure in X] | ||
return np.stack([self(measure, sample_weight=weight) for measure, weight in zip(X, sample_weight)]) | ||
self._running_transform_names = [f"Atol Center {i + 1}" for i in range(self.quantiser.n_clusters)] | ||
return np.stack([self._transform(measure, sample_weight=weight) for measure, weight in zip(X, sample_weight)]) | ||
|
||
def get_feature_names_out(self): | ||
return self._running_transform_names |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -118,23 +118,23 @@ def test_atol_doc(): | |
b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]]) | ||
c = np.array([[3, 2, -1], [1, 2, -1]]) | ||
|
||
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006)) | ||
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10)) | ||
# Atol will do | ||
# X = np.concatenate([a,b,c]) | ||
# kmeans = KMeans(n_clusters=2, random_state=202006).fit(X) | ||
# kmeans = KMeans(n_clusters=2, random_state=202006, n_init=10).fit(X) | ||
# kmeans.labels_ will be : array([1, 0, 1, 0, 0, 1, 0, 0]) | ||
first_cluster = np.asarray([a[0], a[2], b[2]]) | ||
second_cluster = np.asarray([a[1], b[0], b[2], c[0], c[1]]) | ||
second_cluster = np.asarray([a[1], b[0], b[1], c[0], c[1]]) | ||
Comment on lines
-121
to
+127
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @VincentRouvreau I believe this is what you need part2 |
||
|
||
# Check the center of the first_cluster and second_cluster are in Atol centers | ||
centers = atol_vectoriser.fit(X=[a, b, c]).centers | ||
np.isclose(centers, first_cluster.mean(axis=0)).all(1).any() | ||
np.isclose(centers, second_cluster.mean(axis=0)).all(1).any() | ||
|
||
vectorization = atol_vectoriser.transform(X=[a, b, c]) | ||
assert np.allclose(vectorization[0], atol_vectoriser(a)) | ||
assert np.allclose(vectorization[1], atol_vectoriser(b)) | ||
assert np.allclose(vectorization[2], atol_vectoriser(c)) | ||
assert np.allclose(vectorization[0], atol_vectoriser._transform(a)) | ||
assert np.allclose(vectorization[1], atol_vectoriser._transform(b)) | ||
assert np.allclose(vectorization[2], atol_vectoriser._transform(c)) | ||
|
||
|
||
def test_dummy_atol(): | ||
|
@@ -145,12 +145,12 @@ def test_dummy_atol(): | |
for weighting_method in ["cloud", "iidproba"]: | ||
for contrast in ["gaussian", "laplacian", "indicator"]: | ||
atol_vectoriser = Atol( | ||
quantiser=KMeans(n_clusters=1, random_state=202006), | ||
quantiser=KMeans(n_clusters=1, random_state=202006, n_init=10), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @VincentRouvreau I believe this is what you need part3/3 |
||
weighting_method=weighting_method, | ||
contrast=contrast, | ||
) | ||
atol_vectoriser.fit([a, b, c]) | ||
atol_vectoriser(a) | ||
atol_vectoriser._transform(a) | ||
atol_vectoriser.transform(X=[a, b, c]) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this may be more natural with the new format)
Maybe we should say that this class is helpful to avoid going through DimensionSelector+FeatureUnion or similar. It could even be nice to have an example somewhere comparing the 2 (we can add it later though).