Introduction to K-means Clustering
K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms. It is used to partition a dataset into K distinct, non-overlapping subsets (clusters). The goal is to group similar data points together while ensuring that data points in different clusters are as dissimilar as possible.
Key Concepts
- Centroids: The central point of a cluster.
- Inertia: The sum of squared distances between each data point and its nearest centroid.
- Iterations: The process of updating centroids and reassigning data points to the nearest centroid until convergence.
Steps of K-means Algorithm
- Initialization: Randomly select K initial centroids from the dataset.
- Assignment: Assign each data point to the nearest centroid.
- Update: Calculate the new centroids as the mean of all data points assigned to each centroid.
- Repeat: Repeat the assignment and update steps until the centroids no longer change or a maximum number of iterations is reached.
Example
Let's illustrate K-means clustering with a simple example using Python and the sklearn
library.
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans # Generate sample data np.random.seed(42) X = np.random.rand(100, 2) # Apply K-means clustering kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(X) # Get cluster centers and labels centroids = kmeans.cluster_centers_ labels = kmeans.labels_ # Plot the data points and centroids plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o') plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x') plt.title('K-means Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()
Explanation
- Data Generation: We generate 100 random data points with two features.
- K-means Application: We apply K-means clustering with 3 clusters.
- Plotting: We plot the data points and centroids to visualize the clustering.
Practical Exercises
Exercise 1: Implement K-means Clustering
Task: Implement K-means clustering on a given dataset and visualize the results.
Dataset: Use the Iris dataset from sklearn.datasets
.
from sklearn.datasets import load_iris import pandas as pd # Load the Iris dataset iris = load_iris() X = iris.data # Apply K-means clustering kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(X) # Get cluster centers and labels centroids = kmeans.cluster_centers_ labels = kmeans.labels_ # Convert to DataFrame for better visualization df = pd.DataFrame(X, columns=iris.feature_names) df['Cluster'] = labels # Display the first few rows of the DataFrame print(df.head())
Solution:
from sklearn.datasets import load_iris import pandas as pd # Load the Iris dataset iris = load_iris() X = iris.data # Apply K-means clustering kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(X) # Get cluster centers and labels centroids = kmeans.cluster_centers_ labels = kmeans.labels_ # Convert to DataFrame for better visualization df = pd.DataFrame(X, columns=iris.feature_names) df['Cluster'] = labels # Display the first few rows of the DataFrame print(df.head())
Exercise 2: Determine Optimal Number of Clusters
Task: Use the Elbow Method to determine the optimal number of clusters for the Iris dataset.
# Calculate inertia for different values of K inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X) inertia.append(kmeans.inertia_) # Plot the Elbow Curve plt.plot(K, inertia, 'bo-') plt.xlabel('Number of clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method for Optimal K') plt.show()
Solution:
# Calculate inertia for different values of K inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X) inertia.append(kmeans.inertia_) # Plot the Elbow Curve plt.plot(K, inertia, 'bo-') plt.xlabel('Number of clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method for Optimal K') plt.show()
Common Mistakes and Tips
- Initialization Sensitivity: K-means can be sensitive to the initial placement of centroids. Using the
k-means++
initialization method can help mitigate this issue. - Scaling Data: Ensure that your data is scaled properly before applying K-means, as it is sensitive to the scale of the data.
- Choosing K: Use methods like the Elbow Method or Silhouette Score to determine the optimal number of clusters.
Conclusion
K-means clustering is a powerful and widely used algorithm for partitioning data into meaningful clusters. By understanding its key concepts, steps, and practical applications, you can effectively use K-means to analyze and interpret your data. In the next section, we will explore another popular clustering technique: Hierarchical Clustering.
Machine Learning Course
Module 1: Introduction to Machine Learning
- What is Machine Learning?
- History and Evolution of Machine Learning
- Types of Machine Learning
- Applications of Machine Learning
Module 2: Fundamentals of Statistics and Probability
Module 3: Data Preprocessing
Module 4: Supervised Machine Learning Algorithms
- Linear Regression
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)
- K-Nearest Neighbors (K-NN)
- Neural Networks
Module 5: Unsupervised Machine Learning Algorithms
- Clustering: K-means
- Hierarchical Clustering
- Principal Component Analysis (PCA)
- DBSCAN Clustering Analysis
Module 6: Model Evaluation and Validation
Module 7: Advanced Techniques and Optimization
Module 8: Model Implementation and Deployment
- Popular Frameworks and Libraries
- Model Implementation in Production
- Model Maintenance and Monitoring
- Ethical and Privacy Considerations
Module 9: Practical Projects
- Project 1: Housing Price Prediction
- Project 2: Image Classification
- Project 3: Sentiment Analysis on Social Media
- Project 4: Fraud Detection