ML Case Study: K-Means Clustering in NumPy
Building a real machine learning algorithm from scratch with NumPy teaches how vectorization, broadcasting, and optimization techniques work together. K-Means clustering—partitioning data into K groups by iteratively updating cluster centers—is an ideal case study: it's simple enough to implement cleanly, complex enough to showcase vectorization challenges, and widely used in practice. This article walks through the implementation, optimizations, and comparisons with library implementations.
Algorithm Overview: K-Means in Plain English
K-Means clusters N data points into K groups:
- Initialize: randomly pick K cluster centers (centroids).
- Assign: compute distances from each point to each centroid; assign each point to the nearest centroid.
- Update: move each centroid to the mean of its assigned points.
- Repeat: iterate steps 2-3 until convergence (centroids stop moving).
The algorithm minimizes within-cluster variance (sum of squared distances from points to their assigned centroid).
Implementation 1: Explicit Loops (Slow Baseline)
Let's start with a straightforward loop-based implementation to establish a baseline:
import numpy as np
def kmeans_loop(X, K, max_iterations=100, random_state=42):
"""K-Means with explicit loops (slow baseline)."""
np.random.seed(random_state)
n_samples, n_features = X.shape
# Initialize centroids: randomly select K points from X
indices = np.random.choice(n_samples, K, replace=False)
centroids = X[indices].copy()
for iteration in range(max_iterations):
# Step 1: Assign points to nearest centroid (explicit loop)
assignments = np.empty(n_samples, dtype=int)
for i in range(n_samples):
distances = np.empty(K)
for k in range(K):
# Euclidean distance
distances[k] = np.sum((X[i] - centroids[k])**2)
assignments[i] = np.argmin(distances)
# Step 2: Update centroids (explicit loop)
old_centroids = centroids.copy()
for k in range(K):
mask = assignments == k
if np.sum(mask) > 0:
centroids[k] = X[mask].mean(axis=0)
# Check convergence
if np.allclose(old_centroids, centroids):
print(f"Converged at iteration {iteration}")
break
return centroids, assignments
# Test on sample data
np.random.seed(42)
X = np.random.randn(1000, 2) # 1000 points in 2D
centroids, assignments = kmeans_loop(X, K=3)
print(f"Centroids shape: {centroids.shape}")
print(f"Unique clusters: {np.unique(assignments)}")
This explicit-loop version is readable but slow due to Python interpreter overhead.
Implementation 2: Vectorized (Fast)
Now, replace loops with NumPy vectorized operations using broadcasting:
import numpy as np
import timeit
def kmeans_vectorized(X, K, max_iterations=100, random_state=42):
"""K-Means with full vectorization."""
np.random.seed(random_state)
n_samples, n_features = X.shape
# Initialize centroids
indices = np.random.choice(n_samples, K, replace=False)
centroids = X[indices].copy()
for iteration in range(max_iterations):
# Step 1: Compute distances (vectorized)
# Shape: (n_samples, 1, n_features) - (1, K, n_features) = (n_samples, K, n_features)
diff = X[:, np.newaxis, :] - centroids[np.newaxis, :, :]
distances = np.sum(diff**2, axis=2) # shape (n_samples, K)
assignments = np.argmin(distances, axis=1) # shape (n_samples,)
# Step 2: Update centroids (vectorized)
old_centroids = centroids.copy()
for k in range(K):
mask = assignments == k
if np.sum(mask) > 0:
centroids[k] = X[mask].mean(axis=0)
# Check convergence
if np.allclose(old_centroids, centroids):
print(f"Converged at iteration {iteration}")
break
return centroids, assignments
# Benchmark
np.random.seed(42)
X = np.random.randn(1000, 2)
loop_time = timeit.timeit(lambda: kmeans_loop(X, K=3, max_iterations=20), number=5)
vec_time = timeit.timeit(lambda: kmeans_vectorized(X, K=3, max_iterations=20), number=5)
print(f"Loop version: {loop_time:.4f}s per run")
print(f"Vectorized: {vec_time:.4f}s per run")
print(f"Speedup: {loop_time / vec_time:.1f}x")
# Expected: 20-50x speedup
The key optimization: computing all distances at once with broadcasting instead of nested loops.
Implementation 3: Further Optimizations
Pre-compute squared norms and use the identity ||x - c||^2 = ||x||^2 - 2x·c + ||c||^2:
import numpy as np
def kmeans_optimized(X, K, max_iterations=100, random_state=42):
"""K-Means with mathematical optimization."""
np.random.seed(random_state)
n_samples, n_features = X.shape
# Initialize
indices = np.random.choice(n_samples, K, replace=False)
centroids = X[indices].astype(np.float32).copy() # Use float32 to save memory
# Pre-compute squared norms of X
X_squared = np.sum(X**2, axis=1, keepdims=True) # shape (n_samples, 1)
for iteration in range(max_iterations):
# Compute distances using identity: ||x - c||^2 = ||x||^2 - 2x·c + ||c||^2
C_squared = np.sum(centroids**2, axis=1, keepdims=True) # shape (1, K)
cross_term = X @ centroids.T # shape (n_samples, K)
# ||X - C||^2 = ||X||^2 + ||C||^2 - 2 X·C
distances = X_squared + C_squared.T - 2 * cross_term
assignments = np.argmin(distances, axis=1)
# Update centroids
old_centroids = centroids.copy()
for k in range(K):
mask = assignments == k
if np.sum(mask) > 0:
centroids[k] = X[mask].mean(axis=0)
if np.allclose(old_centroids, centroids):
print(f"Converged at iteration {iteration}")
break
return centroids.astype(X.dtype), assignments
# Benchmark optimized version
X = np.random.randn(10000, 50) # Larger dataset
opt_time = timeit.timeit(lambda: kmeans_optimized(X, K=10, max_iterations=20), number=5)
print(f"Optimized version: {opt_time:.4f}s per run")
Using the algebraic identity avoids computing squared distances naively, reducing operations from O(nkd) to O(nk + n·d + k·d).
Full Implementation with Features
import numpy as np
class KMeans:
"""K-Means clustering with NumPy."""
def __init__(self, K=3, max_iterations=100, tol=1e-4, random_state=42):
self.K = K
self.max_iterations = max_iterations
self.tol = tol
self.random_state = random_state
self.centroids = None
self.assignments = None
self.history = {'inertia': [], 'iterations': 0}
def fit(self, X):
"""Fit K-Means to data."""
np.random.seed(self.random_state)
n_samples = X.shape[0]
# Initialize centroids
indices = np.random.choice(n_samples, self.K, replace=False)
self.centroids = X[indices].astype(np.float32).copy()
for iteration in range(self.max_iterations):
# Assign
distances = self._compute_distances(X)
assignments = np.argmin(distances, axis=1)
# Compute inertia (sum of squared distances)
inertia = np.sum(np.min(distances, axis=1))
self.history['inertia'].append(inertia)
# Update
old_centroids = self.centroids.copy()
for k in range(self.K):
mask = assignments == k
if np.sum(mask) > 0:
self.centroids[k] = X[mask].mean(axis=0)
# Check convergence
if np.max(np.abs(old_centroids - self.centroids)) < self.tol:
self.history['iterations'] = iteration
print(f"Converged at iteration {iteration}, inertia: {inertia:.4f}")
break
self.assignments = assignments
return self
def _compute_distances(self, X):
"""Compute squared distances to all centroids."""
X_sq = np.sum(X**2, axis=1, keepdims=True)
C_sq = np.sum(self.centroids**2, axis=1, keepdims=True)
cross = X @ self.centroids.T
distances = X_sq + C_sq.T - 2 * cross
return np.maximum(distances, 0) # Numerical stability
def predict(self, X):
"""Assign points to nearest centroid."""
distances = self._compute_distances(X)
return np.argmin(distances, axis=1)
def inertia(self, X):
"""Compute inertia (within-cluster sum of squares)."""
distances = self._compute_distances(X)
return np.sum(np.min(distances, axis=1))
# Usage
np.random.seed(42)
X = np.random.randn(1000, 10)
kmeans = KMeans(K=5, max_iterations=50)
kmeans.fit(X)
print(f"Final inertia: {kmeans.history['inertia'][-1]:.4f}")
# Predict on new data
X_new = np.random.randn(100, 10)
predictions = kmeans.predict(X_new)
print(f"Assignments shape: {predictions.shape}")
print(f"Unique clusters: {np.unique(predictions)}")
Comparison with Scikit-Learn
Verify our implementation matches a library:
from sklearn.cluster import KMeans as SklearnKMeans
import numpy as np
np.random.seed(42)
X = np.random.randn(1000, 10)
# Our implementation
kmeans_ours = KMeans(K=5, max_iterations=50, random_state=42)
kmeans_ours.fit(X)
inertia_ours = kmeans_ours.inertia(X)
# Scikit-learn
kmeans_sklearn = SklearnKMeans(n_clusters=5, max_iter=50, random_state=42, n_init=1)
kmeans_sklearn.fit(X)
inertia_sklearn = kmeans_sklearn.inertia_
print(f"Our inertia: {inertia_ours:.4f}")
print(f"Scikit-learn inertia: {inertia_sklearn:.4f}")
print(f"Difference: {abs(inertia_ours - inertia_sklearn):.4f}")
# Should be very close (differences due to initialization order)
Key Vectorization Lessons
- Broadcasting: Use
X[:, np.newaxis, :]to align dimensions for pairwise operations. - Algebraic identity:
||x - c||^2 = ||x||^2 - 2x·c + ||c||^2reduces computation. - Avoid inner loops: Pre-compute values and use vectorized updates where possible.
- Memory vs speed: float32 saves memory; float64 is more accurate. Choose based on data scale.
- Convergence: Monitor inertia or centroid changes to detect convergence early.
Key Takeaways
- Vectorization provides 20–50x speedup over explicit loops even for complex algorithms.
- Broadcasting enables computing pairwise distances without nested loops.
- Algebraic tricks (e.g., distance formula identity) reduce algorithmic complexity.
- Implementing ML algorithms from scratch deepens understanding; compare with libraries to validate.
- Inertia (within-cluster sum of squares) is the objective K-Means minimizes; use it to evaluate convergence and cluster quality.
Frequently Asked Questions
Why does K-Means sometimes get stuck in local minima?
K-Means is not guaranteed to find the global optimum; it converges to a local minimum. Use multiple random initializations (random_state) and choose the result with lowest inertia, or try K-Means++ initialization.
How do I choose the number of clusters K?
Use the elbow method: fit K-Means for K = 1, 2, ..., 10 and plot inertia vs K. Look for an "elbow" where inertia decrease slows. Alternatively, use silhouette score or gap statistic.
Is K-Means suitable for high-dimensional data?
K-Means works but less effectively in high dimensions (curse of dimensionality). Consider dimensionality reduction (PCA) first, or use Gaussian Mixture Models (GMM) with covariance structure.
Can I use K-Means for categorical data?
No, K-Means requires continuous features and Euclidean distance. For categorical data, use K-Modes or hierarchical clustering.
How do I handle initialization sensitivity?
Use K-Means++ initialization (scikit-learn's default) instead of random, or run multiple times and select the best. K-Means++ gives better initial centroids, leading to faster convergence.