Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be specified in advance. It is particularly useful for discovering the underlying structure of the data.

Key Concepts

Types of Hierarchical Clustering

  1. Agglomerative (Bottom-Up) Clustering:

    • Starts with each data point as a single cluster.
    • Iteratively merges the closest pairs of clusters until only one cluster remains or a stopping criterion is met.
  2. Divisive (Top-Down) Clustering:

    • Starts with all data points in a single cluster.
    • Iteratively splits the clusters into smaller clusters until each cluster contains a single data point or a stopping criterion is met.

Dendrogram

  • A dendrogram is a tree-like diagram that records the sequences of merges or splits.
  • The y-axis represents the distance or dissimilarity between clusters.
  • The x-axis represents the individual data points.

Linkage Criteria

  • Single Linkage: Distance between the closest points of the clusters.
  • Complete Linkage: Distance between the farthest points of the clusters.
  • Average Linkage: Average distance between all pairs of points in the clusters.
  • Ward's Method: Minimizes the total within-cluster variance.

Steps in Hierarchical Clustering

  1. Calculate the Distance Matrix:

    • Compute the pairwise distance between all data points using a distance metric (e.g., Euclidean distance).
  2. Choose a Linkage Criteria:

    • Decide on the method to measure the distance between clusters (e.g., single, complete, average, Ward's).
  3. Build the Dendrogram:

    • Start with each data point as its own cluster.
    • Merge the closest clusters based on the chosen linkage criteria.
    • Repeat until all points are merged into a single cluster.
  4. Cut the Dendrogram:

    • Determine the number of clusters by cutting the dendrogram at the desired level.

Practical Example

Example Data

Let's consider a simple dataset with 5 points in a 2D space:

Point X Y
A 1 2
B 2 3
C 3 4
D 5 6
E 8 8

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Sample data
data = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [8, 8]])

# Perform hierarchical/agglomerative clustering
Z = linkage(data, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, labels=['A', 'B', 'C', 'D', 'E'])
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()

Explanation

  1. Data Preparation:

    • We create a numpy array with our sample data points.
  2. Linkage Calculation:

    • We use the linkage function from scipy.cluster.hierarchy to perform hierarchical clustering. Here, we use Ward's method.
  3. Dendrogram Plotting:

    • We plot the dendrogram using the dendrogram function. The labels correspond to our data points.

Practical Exercise

Exercise

Given the following dataset, perform hierarchical clustering using complete linkage and plot the dendrogram.

Point X Y
F 1 1
G 2 1
H 4 3
I 5 4
J 7 5

Solution

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Sample data
data = np.array([[1, 1], [2, 1], [4, 3], [5, 4], [7, 5]])

# Perform hierarchical/agglomerative clustering
Z = linkage(data, method='complete')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, labels=['F', 'G', 'H', 'I', 'J'])
plt.title('Dendrogram for Hierarchical Clustering (Complete Linkage)')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()

Explanation

  1. Data Preparation:

    • We create a numpy array with our sample data points.
  2. Linkage Calculation:

    • We use the linkage function from scipy.cluster.hierarchy to perform hierarchical clustering with complete linkage.
  3. Dendrogram Plotting:

    • We plot the dendrogram using the dendrogram function. The labels correspond to our data points.

Summary

Hierarchical clustering is a versatile clustering technique that builds a hierarchy of clusters without requiring the number of clusters to be specified in advance. By understanding the different types of hierarchical clustering, linkage criteria, and how to interpret dendrograms, you can effectively apply this method to various datasets.

© Copyright 2024. All rights reserved