KMeans Clustering Algorithm with Visualization and KMeans++

Project Overview

This project implements the KMeans clustering algorithm in Python with additional functionality for visualizing each step of the clustering process. It includes two initialization methods: random initialization and KMeans++ for more optimized centroid initialization. The implementation also supports the Manhattan and Euclidean distance metrics.

Key Features

Customizable distance metrics: Supports both Euclidean and Manhattan distance metrics.
KMeans++: Provides better initialization of centroids for faster convergence.
Visualization: Visualizes the clustering process and how centroids evolve with each iteration.
Hyperparameter tuning: Allows easy tuning of parameters such as k, tol, max_iter, method, and metric.
Model evaluation: Uses inertia and Rand Index to evaluate the quality of the clustering.

Installation

To get started, clone this repository and install the necessary dependencies:

git clone https://github.com/Sti11ness/kmeans-clustering.git
cd kmeans-clustering

Usage

The project consists of a KMeans class that allows for flexible clustering based on various parameters like distance metrics and initialization methods. The fit method is responsible for running the clustering algorithm, while the evaluate method calculates the inertia to assess model performance.

from kmeans import KMeans
import pandas as pd

# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Also you can Normalize your data if you need(ofc you need)
# Initialize KMeans model
model = KMeans()
df_copy = df[['feature1', 'feature2']].copy(deep=True)
# Fit the model
model.fit(df_copy, k=3, method='kmeans++', metric='euclidean')

# Evaluate the model
score = model.evaluate(df)

KMeans++

KMeans++ is a centroid initialization method that aims to speed up convergence by placing the initial centroids far apart. You can enable KMeans++ by specifying the method in the fit function:

model.fit(df[['feature1', 'feature2']], k=3, method='kmeans++', metric='manhattan')

Hyperparameter Tuning

To tune hyperparameters like the distance metric, initialization method, and the number of iterations, you can use a simple loop or a more sophisticated grid search approach:

from itertools import product

params = {
    'metric': ['euclidean', 'manhattan'],
    'method': ['random', 'kmeans++'],
    'max_iter': [50, 100, 150],
    'tol': [1e-3, 1e-4]
}

best_score = -1
best_params = {}

# Grid search over hyperparameters
for metric, method, max_iter, tol in product(params['metric'], params['method'], params['max_iter'], params['tol']):
    model = KMeans()
    model.fit(df[['feature1', 'feature2']], k=3, method=method, metric=metric, max_iter=max_iter, tol=tol)
    score = model.evaluate(df)
    if score > best_score:
        best_score = score
        best_params = {'metric': metric, 'method': method, 'max_iter': max_iter, 'tol': tol}

print(f"Best Score: {best_score}")
print(f"Best Parameters: {best_params}")

Visualization

The model saves the history of centroids and cluster assignments at each iteration, which allows for detailed visualization of the clustering process:

import matplotlib.pyplot as plt

def visualize_kmeans(X, labels, centroids, iteration):
    plt.scatter(X['feature1'], X['feature2'], c=labels)
    plt.scatter(centroids[:, 0], centroids[:, 1], color='red', marker='x')
    plt.title(f"Iteration {iteration}")
    plt.show()

for i, (centroids, labels) in enumerate(zip(model.history['centroids'], model.history['clusters'])):
    visualize_kmeans(df[['feature1', 'feature2']], labels, centroids, i)

Evaluation Metrics

Inertia: Measures the sum of squared distances from each point to its assigned centroid. Lower inertia means better clustering.
Rand Index: Measures the similarity between two data clusterings. It ranges from 0 to 1, where 1 represents perfect clustering.

Example:

from sklearn.metrics import rand_score

rand_idx = rand_score(true_labels, predicted_labels)
print(f"Rand Index: {rand_idx}")

Examples

Run the KMeans model with different settings and visualize the results:

# Initialize model
model = KMeans()

# Fit model with k=3 and visualize
model.fit(df[['height', 'weight']], k=3, metric='euclidean', method='kmeans++', max_iter=100, track_history=True)

# Visualize clustering process
for i, (centroids, labels) in enumerate(zip(model.history['centroids'], model.history['clusters'])):
    visualize_kmeans(df[['height', 'weight']], labels, centroids, i)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
README.md		README.md
km.ipynb		km.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KMeans Clustering Algorithm with Visualization and KMeans++

Project Overview

Key Features

Table of Contents

Installation

Usage

KMeans++

Hyperparameter Tuning

Visualization

Evaluation Metrics

Examples

References

About

Releases

Packages

Languages

Sti11ness/kmeans-clustering

Folders and files

Latest commit

History

Repository files navigation

KMeans Clustering Algorithm with Visualization and KMeans++

Project Overview

Key Features

Table of Contents

Installation

Usage

KMeans++

Hyperparameter Tuning

Visualization

Evaluation Metrics

Examples

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages