Introduction to K-means Clustering

K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms. It is used to partition a dataset into K distinct, non-overlapping subsets (clusters). The goal is to group similar data points together while ensuring that data points in different clusters are as dissimilar as possible.

Key Concepts

  1. Centroids: The central point of a cluster.
  2. Inertia: The sum of squared distances between each data point and its nearest centroid.
  3. Iterations: The process of updating centroids and reassigning data points to the nearest centroid until convergence.

Steps of K-means Algorithm

  1. Initialization: Randomly select K initial centroids from the dataset.
  2. Assignment: Assign each data point to the nearest centroid.
  3. Update: Calculate the new centroids as the mean of all data points assigned to each centroid.
  4. Repeat: Repeat the assignment and update steps until the centroids no longer change or a maximum number of iterations is reached.

Example

Let's illustrate K-means clustering with a simple example using Python and the sklearn library.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 2)

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster centers and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

# Plot the data points and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Explanation

  1. Data Generation: We generate 100 random data points with two features.
  2. K-means Application: We apply K-means clustering with 3 clusters.
  3. Plotting: We plot the data points and centroids to visualize the clustering.

Practical Exercises

Exercise 1: Implement K-means Clustering

Task: Implement K-means clustering on a given dataset and visualize the results.

Dataset: Use the Iris dataset from sklearn.datasets.

from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster centers and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

# Convert to DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['Cluster'] = labels

# Display the first few rows of the DataFrame
print(df.head())

Solution:

from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster centers and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

# Convert to DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['Cluster'] = labels

# Display the first few rows of the DataFrame
print(df.head())

Exercise 2: Determine Optimal Number of Clusters

Task: Use the Elbow Method to determine the optimal number of clusters for the Iris dataset.

# Calculate inertia for different values of K
inertia = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Curve
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

Solution:

# Calculate inertia for different values of K
inertia = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Curve
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

Common Mistakes and Tips

  • Initialization Sensitivity: K-means can be sensitive to the initial placement of centroids. Using the k-means++ initialization method can help mitigate this issue.
  • Scaling Data: Ensure that your data is scaled properly before applying K-means, as it is sensitive to the scale of the data.
  • Choosing K: Use methods like the Elbow Method or Silhouette Score to determine the optimal number of clusters.

Conclusion

K-means clustering is a powerful and widely used algorithm for partitioning data into meaningful clusters. By understanding its key concepts, steps, and practical applications, you can effectively use K-means to analyze and interpret your data. In the next section, we will explore another popular clustering technique: Hierarchical Clustering.

© Copyright 2024. All rights reserved