Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be specified in advance. It is particularly useful for discovering the underlying structure of the data.
Key Concepts
Types of Hierarchical Clustering
-
Agglomerative (Bottom-Up) Clustering:
- Starts with each data point as a single cluster.
- Iteratively merges the closest pairs of clusters until only one cluster remains or a stopping criterion is met.
-
Divisive (Top-Down) Clustering:
- Starts with all data points in a single cluster.
- Iteratively splits the clusters into smaller clusters until each cluster contains a single data point or a stopping criterion is met.
Dendrogram
- A dendrogram is a tree-like diagram that records the sequences of merges or splits.
- The y-axis represents the distance or dissimilarity between clusters.
- The x-axis represents the individual data points.
Linkage Criteria
- Single Linkage: Distance between the closest points of the clusters.
- Complete Linkage: Distance between the farthest points of the clusters.
- Average Linkage: Average distance between all pairs of points in the clusters.
- Ward's Method: Minimizes the total within-cluster variance.
Steps in Hierarchical Clustering
-
Calculate the Distance Matrix:
- Compute the pairwise distance between all data points using a distance metric (e.g., Euclidean distance).
-
Choose a Linkage Criteria:
- Decide on the method to measure the distance between clusters (e.g., single, complete, average, Ward's).
-
Build the Dendrogram:
- Start with each data point as its own cluster.
- Merge the closest clusters based on the chosen linkage criteria.
- Repeat until all points are merged into a single cluster.
-
Cut the Dendrogram:
- Determine the number of clusters by cutting the dendrogram at the desired level.
Practical Example
Example Data
Let's consider a simple dataset with 5 points in a 2D space:
Point | X | Y |
---|---|---|
A | 1 | 2 |
B | 2 | 3 |
C | 3 | 4 |
D | 5 | 6 |
E | 8 | 8 |
Python Implementation
import numpy as np import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Sample data data = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [8, 8]]) # Perform hierarchical/agglomerative clustering Z = linkage(data, method='ward') # Plot the dendrogram plt.figure(figsize=(10, 7)) dendrogram(Z, labels=['A', 'B', 'C', 'D', 'E']) plt.title('Dendrogram for Hierarchical Clustering') plt.xlabel('Data Points') plt.ylabel('Distance') plt.show()
Explanation
-
Data Preparation:
- We create a numpy array with our sample data points.
-
Linkage Calculation:
- We use the
linkage
function fromscipy.cluster.hierarchy
to perform hierarchical clustering. Here, we use Ward's method.
- We use the
-
Dendrogram Plotting:
- We plot the dendrogram using the
dendrogram
function. The labels correspond to our data points.
- We plot the dendrogram using the
Practical Exercise
Exercise
Given the following dataset, perform hierarchical clustering using complete linkage and plot the dendrogram.
Point | X | Y |
---|---|---|
F | 1 | 1 |
G | 2 | 1 |
H | 4 | 3 |
I | 5 | 4 |
J | 7 | 5 |
Solution
import numpy as np import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Sample data data = np.array([[1, 1], [2, 1], [4, 3], [5, 4], [7, 5]]) # Perform hierarchical/agglomerative clustering Z = linkage(data, method='complete') # Plot the dendrogram plt.figure(figsize=(10, 7)) dendrogram(Z, labels=['F', 'G', 'H', 'I', 'J']) plt.title('Dendrogram for Hierarchical Clustering (Complete Linkage)') plt.xlabel('Data Points') plt.ylabel('Distance') plt.show()
Explanation
-
Data Preparation:
- We create a numpy array with our sample data points.
-
Linkage Calculation:
- We use the
linkage
function fromscipy.cluster.hierarchy
to perform hierarchical clustering with complete linkage.
- We use the
-
Dendrogram Plotting:
- We plot the dendrogram using the
dendrogram
function. The labels correspond to our data points.
- We plot the dendrogram using the
Summary
Hierarchical clustering is a versatile clustering technique that builds a hierarchy of clusters without requiring the number of clusters to be specified in advance. By understanding the different types of hierarchical clustering, linkage criteria, and how to interpret dendrograms, you can effectively apply this method to various datasets.
Machine Learning Course
Module 1: Introduction to Machine Learning
- What is Machine Learning?
- History and Evolution of Machine Learning
- Types of Machine Learning
- Applications of Machine Learning
Module 2: Fundamentals of Statistics and Probability
Module 3: Data Preprocessing
Module 4: Supervised Machine Learning Algorithms
- Linear Regression
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)
- K-Nearest Neighbors (K-NN)
- Neural Networks
Module 5: Unsupervised Machine Learning Algorithms
- Clustering: K-means
- Hierarchical Clustering
- Principal Component Analysis (PCA)
- DBSCAN Clustering Analysis
Module 6: Model Evaluation and Validation
Module 7: Advanced Techniques and Optimization
Module 8: Model Implementation and Deployment
- Popular Frameworks and Libraries
- Model Implementation in Production
- Model Maintenance and Monitoring
- Ethical and Privacy Considerations
Module 9: Practical Projects
- Project 1: Housing Price Prediction
- Project 2: Image Classification
- Project 3: Sentiment Analysis on Social Media
- Project 4: Fraud Detection