Introduction

Clustering is a type of unsupervised learning that involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. Clustering is widely used in various fields such as data mining, pattern recognition, image analysis, and bioinformatics.

Key Concepts:

  • Cluster: A collection of data points aggregated together because of certain similarities.
  • Centroid: The center of a cluster.
  • Intra-cluster distance: The distance between data points within the same cluster.
  • Inter-cluster distance: The distance between data points in different clusters.

Types of Clustering Algorithms

  1. Partitioning Methods: These methods divide the data into distinct non-overlapping subsets (clusters).

    • K-Means
    • K-Medoids
  2. Hierarchical Methods: These methods create a hierarchy of clusters.

    • Agglomerative Hierarchical Clustering
    • Divisive Hierarchical Clustering
  3. Density-Based Methods: These methods form clusters based on the density of data points in the data space.

    • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
    • OPTICS (Ordering Points To Identify the Clustering Structure)
  4. Model-Based Methods: These methods assume that the data is generated by a mixture of underlying probability distributions.

    • Gaussian Mixture Models (GMM)
    • Expectation-Maximization (EM)

K-Means Clustering

Algorithm Steps:

  1. Initialization: Choose k initial centroids randomly.
  2. Assignment: Assign each data point to the nearest centroid.
  3. Update: Calculate the new centroids as the mean of the assigned data points.
  4. Repeat: Repeat the assignment and update steps until convergence (i.e., the centroids no longer change).

Example:

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Cluster centers
print("Cluster centers:", kmeans.cluster_centers_)

# Labels of each point
print("Labels:", kmeans.labels_)

# Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

Explanation:

  • Initialization: The KMeans object is initialized with n_clusters=2.
  • Assignment: Each point is assigned to the nearest centroid.
  • Update: The centroids are updated based on the mean of the points in each cluster.
  • Repeat: The process repeats until the centroids stabilize.

DBSCAN Clustering

Algorithm Steps:

  1. Core Points: Identify core points that have at least min_samples points within eps distance.
  2. Directly Density-Reachable: Points within eps distance from a core point are directly density-reachable.
  3. Density-Reachable: Points reachable from core points through a chain of directly density-reachable points.
  4. Cluster Formation: Form clusters from connected core points and their density-reachable points.

Example:

from sklearn.cluster import DBSCAN

# Sample data
X = np.array([[1, 2], [2, 2], [2, 3],
              [8, 7], [8, 8], [25, 80]])

# DBSCAN clustering
dbscan = DBSCAN(eps=3, min_samples=2).fit(X)

# Labels of each point
print("Labels:", dbscan.labels_)

# Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='viridis')
plt.show()

Explanation:

  • Core Points: Points with at least min_samples neighbors within eps distance.
  • Directly Density-Reachable: Points within eps distance from a core point.
  • Density-Reachable: Points reachable through a chain of directly density-reachable points.
  • Cluster Formation: Clusters are formed from connected core points and their density-reachable points.

Practical Exercises

Exercise 1: Implement K-Means Clustering

Task: Implement the K-Means clustering algorithm from scratch and apply it to a sample dataset.

Solution:

import numpy as np

def kmeans(X, k, max_iters=100):
    # Randomly initialize centroids
    centroids = X[np.random.choice(X.shape[0], k, replace=False)]
    
    for _ in range(max_iters):
        # Assign clusters
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        
        # Update centroids
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
        
        # Check for convergence
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    
    return centroids, labels

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Apply K-Means
centroids, labels = kmeans(X, k=2)

print("Cluster centers:", centroids)
print("Labels:", labels)

Exercise 2: Apply DBSCAN Clustering

Task: Apply the DBSCAN clustering algorithm to a sample dataset and visualize the results.

Solution:

from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt

# Sample data
X = np.array([[1, 2], [2, 2], [2, 3],
              [8, 7], [8, 8], [25, 80]])

# Apply DBSCAN
dbscan = DBSCAN(eps=3, min_samples=2).fit(X)

# Labels of each point
labels = dbscan.labels_

# Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.show()

Summary

In this section, we explored clustering algorithms, focusing on K-Means and DBSCAN. We discussed their key concepts, algorithm steps, and provided practical examples and exercises. Clustering is a powerful tool for discovering patterns and structures in data, and understanding these algorithms is essential for advanced data analysis and machine learning tasks.

Next, we will delve into the final module, where we will explore case studies and applications of these algorithms in real-world scenarios.

© Copyright 2024. All rights reserved