-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSKL33-ClusterCorrectness.tex
190 lines (186 loc) · 6.43 KB
/
SKL33-ClusterCorrectness.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
\documentclass[SKL-MASTER.tex]{subfiles}
\begin{document}
\Large
\section*{Assessing cluster correctness}
We talked a little bit about assessing clusters when the ground truth is not known. However,
we have not yet talked about assessing KMeans when the cluster is known. In a lot of cases,
this isn't knowable; however, if there is outside annotation, we will know the ground truth,
or at least the proxy, sometimes.
\subsection*{Getting Some Data}
So, let's assume a world where we have some outside agent supplying us with the
ground truth.
We'll create a simple dataset, evaluate the measures of correctness against the
ground truth in several ways, and then discuss them:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> from sklearn import datasets
>>> from sklearn import cluster
>>> blobs, ground_truth = datasets.make_blobs(1000,
centers=3, cluster_std=1.75)
\end{verbatim}
\end{framed}
% \subsubsection{Implementation}
\noindent Before we walk through the metrics, let's take a look at the dataset:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> colors = ['r', 'g', 'b']
>>> for i in range(3):
p = blobs[ground_truth == i]
ax.scatter(p[:,0], p[:,1], c=colors[i],
label="Cluster {}".format(i))
>>> ax.set_title("Cluster With Ground Truth")
>>> ax.legend()
>>> f.savefig("9485OS_03-16")
\end{verbatim}
\end{framed}
%========================================================%
% % Building Models with Distance Metrics
% % 94
The following is the output:
\begin{figure}[h!]
\centering
\includegraphics[width=0.7\linewidth]{Cluster1}
% \caption{}
% \label{fig:Cluster1}
\end{figure}
In order to fit a KMeans model we'll create a KMeans object from the cluster module:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> kmeans = cluster.KMeans(n_clusters=3)
>>> kmeans.fit(blobs)
KMeans(copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=1,
precompute_distances=True,
random_state=None, tol=0.0001, verbose=0)
>>>
>>> kmeans.cluster_centers_
array([[ 5.18993766, 0.35110059],
[ 0.18300097, -4.9480336 ],
[ 10.01421381, -2.26274328]])
\end{verbatim}
\end{framed}
Now that we've fit the model, let's have a look at the cluster centroids:
%========================================================%
% % Chapter 3
% % 95
\begin{framed}
\begin{verbatim}
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> colors = ['r', 'g', 'b']
>>> for i in range(3):
p = blobs[ground_truth == i]
ax.scatter(p[:,0], p[:,1], c=colors[i],
label="Cluster {}".format(i))
>>> ax.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
s=100,
color='black',
label='Centers')
>>> ax.set_title("Cluster With Ground Truth")
>>> ax.legend()
>>> f.savefig("9485OS_03-17")
\end{verbatim}
\end{framed}
\noindent The following is the output: \textbf{GRAPH}
Now that we can view the clustering performance as a classification exercise, the metrics that
are useful in its context are also useful here:
%--------------------- %
{
\large
\begin{framed}
\begin{verbatim}
>>> for i in range(3):
print (kmeans.labels_ == ground_truth)[ground_truth == i]
.astype(int).mean()
0.0778443113772
0.990990990991
0.0570570570571
\end{verbatim}
\end{framed}
}
%========================================================%
% % Building Models with Distance Metrics
% % 96
Clearly, we have some backward clusters. So, let's get this straightened out first, and then
we'll look at the accuracy:
%--------------------- %
{
\large
\begin{framed}
\begin{verbatim}
>>> new_ground_truth = ground_truth.copy()
>>> new_ground_truth[ground_truth == 0] = 2
>>> new_ground_truth[ground_truth == 2] = 0
>>> for i in range(3):
print (kmeans.labels_ == new_ground_truth)[ground_truth == i]
.astype(int).mean()
0.919161676647
0.990990990991
0.90990990991
\end{verbatim}
\end{framed}
}
So, we're roughly correct 90 percent of the time. The second measure of similarity we'll look
at is the mutual information score:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> from sklearn import metrics
>>> metrics.normalized_mutual_info_score(ground_truth, kmeans.labels_)
0.78533737204433651
\end{verbatim}
\end{framed}
As the score tends to be 0, the label assignments are probably not generated through
similar processes; however, the score being closer to 1 means that there is a large
amount of agreement between the two labels.
For example, let's look at what happens when the mutual information score itself:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> metrics.normalized_mutual_info_score(ground_truth,
ground_truth)
1.0
\end{verbatim}
\end{framed}
Given the name, we can tell that there is probably an unnormalized \texttt{mutual\_info\_score}:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> metrics.mutual_info_score(ground_truth, kmeans.labels_)
0.78945287371677486
\end{verbatim}
\end{framed}
These are very close; however, normalized mutual information is the mutual information
divided by the root of the product of the entropy of each set truth and assigned label.
%========================================================%
% % Chapter 3
% % 97
\subsection*{Further Remarks} % There's more...
\begin{itemize}
\item One cluster metric we haven't talked about yet and one that is not reliant on the ground truth
is inertia.
\item It is not very well documented as a metric at the moment.
\item However, it is the metric
that \texttt{KMeans} minimizes.
\end{itemize}
\subsection*{Inertia}
Inertia is the sum of the squared difference between each point and its assigned cluster.
We can use a little NumPy to determine this:
%--------------------- %
\begin{framed}
\begin{verbatim}
>>> kmeans.inertia_
\end{verbatim}
\end{framed}
for scikit learn, inertia is calculated as the sum of squared distance for each point to it's closest centroid, i.e., its assigned cluster. So $I = \sum_{i}(d(i,cr))$ where cr is the centroid of the assigned cluster and d is the squared distance.
%
%Now the formula of gap statistic involves
%\[W_k = \sum_{r=1}^{k}\frac 1 {(2*n_r) }D_r\]
%where Dr is the sum of the squared distances between all points in cluster r.
%
%By introducing +c, −c in the squared distance formula (c being the centroid of cluster r coordinates), I have a term that corresponds to Inertia (as in scikit) + a term that disappears if each c is the barycentre of each cluster (which it is supposed to be in kmeans). So I guess Wk is in fact scikit Inertia.
\end{document}