Skip to main content

ML Case Study: K-Means Clustering in NumPy

Building a real machine learning algorithm from scratch with NumPy teaches how vectorization, broadcasting, and optimization techniques work together. K-Means clustering—partitioning data into K groups by iteratively updating cluster centers—is an ideal case study: it's simple enough to implement cleanly, complex enough to showcase vectorization challenges, and widely used in practice. This article walks through the implementation, optimizations, and comparisons with library implementations.

Algorithm Overview: K-Means in Plain English

K-Means clusters N data points into K groups:

  1. Initialize: randomly pick K cluster centers (centroids).
  2. Assign: compute distances from each point to each centroid; assign each point to the nearest centroid.
  3. Update: move each centroid to the mean of its assigned points.
  4. Repeat: iterate steps 2-3 until convergence (centroids stop moving).

The algorithm minimizes within-cluster variance (sum of squared distances from points to their assigned centroid).

Implementation 1: Explicit Loops (Slow Baseline)

Let's start with a straightforward loop-based implementation to establish a baseline:

import numpy as np

def kmeans_loop(X, K, max_iterations=100, random_state=42):
"""K-Means with explicit loops (slow baseline)."""
np.random.seed(random_state)
n_samples, n_features = X.shape

# Initialize centroids: randomly select K points from X
indices = np.random.choice(n_samples, K, replace=False)
centroids = X[indices].copy()

for iteration in range(max_iterations):
# Step 1: Assign points to nearest centroid (explicit loop)
assignments = np.empty(n_samples, dtype=int)
for i in range(n_samples):
distances = np.empty(K)
for k in range(K):
# Euclidean distance
distances[k] = np.sum((X[i] - centroids[k])**2)
assignments[i] = np.argmin(distances)

# Step 2: Update centroids (explicit loop)
old_centroids = centroids.copy()
for k in range(K):
mask = assignments == k
if np.sum(mask) > 0:
centroids[k] = X[mask].mean(axis=0)

# Check convergence
if np.allclose(old_centroids, centroids):
print(f"Converged at iteration {iteration}")
break

return centroids, assignments

# Test on sample data
np.random.seed(42)
X = np.random.randn(1000, 2) # 1000 points in 2D
centroids, assignments = kmeans_loop(X, K=3)
print(f"Centroids shape: {centroids.shape}")
print(f"Unique clusters: {np.unique(assignments)}")

This explicit-loop version is readable but slow due to Python interpreter overhead.

Implementation 2: Vectorized (Fast)

Now, replace loops with NumPy vectorized operations using broadcasting:

import numpy as np
import timeit

def kmeans_vectorized(X, K, max_iterations=100, random_state=42):
"""K-Means with full vectorization."""
np.random.seed(random_state)
n_samples, n_features = X.shape

# Initialize centroids
indices = np.random.choice(n_samples, K, replace=False)
centroids = X[indices].copy()

for iteration in range(max_iterations):
# Step 1: Compute distances (vectorized)
# Shape: (n_samples, 1, n_features) - (1, K, n_features) = (n_samples, K, n_features)
diff = X[:, np.newaxis, :] - centroids[np.newaxis, :, :]
distances = np.sum(diff**2, axis=2) # shape (n_samples, K)
assignments = np.argmin(distances, axis=1) # shape (n_samples,)

# Step 2: Update centroids (vectorized)
old_centroids = centroids.copy()
for k in range(K):
mask = assignments == k
if np.sum(mask) > 0:
centroids[k] = X[mask].mean(axis=0)

# Check convergence
if np.allclose(old_centroids, centroids):
print(f"Converged at iteration {iteration}")
break

return centroids, assignments

# Benchmark
np.random.seed(42)
X = np.random.randn(1000, 2)

loop_time = timeit.timeit(lambda: kmeans_loop(X, K=3, max_iterations=20), number=5)
vec_time = timeit.timeit(lambda: kmeans_vectorized(X, K=3, max_iterations=20), number=5)

print(f"Loop version: {loop_time:.4f}s per run")
print(f"Vectorized: {vec_time:.4f}s per run")
print(f"Speedup: {loop_time / vec_time:.1f}x")
# Expected: 20-50x speedup

The key optimization: computing all distances at once with broadcasting instead of nested loops.

Implementation 3: Further Optimizations

Pre-compute squared norms and use the identity ||x - c||^2 = ||x||^2 - 2x·c + ||c||^2:

import numpy as np

def kmeans_optimized(X, K, max_iterations=100, random_state=42):
"""K-Means with mathematical optimization."""
np.random.seed(random_state)
n_samples, n_features = X.shape

# Initialize
indices = np.random.choice(n_samples, K, replace=False)
centroids = X[indices].astype(np.float32).copy() # Use float32 to save memory

# Pre-compute squared norms of X
X_squared = np.sum(X**2, axis=1, keepdims=True) # shape (n_samples, 1)

for iteration in range(max_iterations):
# Compute distances using identity: ||x - c||^2 = ||x||^2 - 2x·c + ||c||^2
C_squared = np.sum(centroids**2, axis=1, keepdims=True) # shape (1, K)
cross_term = X @ centroids.T # shape (n_samples, K)

# ||X - C||^2 = ||X||^2 + ||C||^2 - 2 X·C
distances = X_squared + C_squared.T - 2 * cross_term
assignments = np.argmin(distances, axis=1)

# Update centroids
old_centroids = centroids.copy()
for k in range(K):
mask = assignments == k
if np.sum(mask) > 0:
centroids[k] = X[mask].mean(axis=0)

if np.allclose(old_centroids, centroids):
print(f"Converged at iteration {iteration}")
break

return centroids.astype(X.dtype), assignments

# Benchmark optimized version
X = np.random.randn(10000, 50) # Larger dataset
opt_time = timeit.timeit(lambda: kmeans_optimized(X, K=10, max_iterations=20), number=5)
print(f"Optimized version: {opt_time:.4f}s per run")

Using the algebraic identity avoids computing squared distances naively, reducing operations from O(nkd) to O(nk + n·d + k·d).

Full Implementation with Features

import numpy as np

class KMeans:
"""K-Means clustering with NumPy."""

def __init__(self, K=3, max_iterations=100, tol=1e-4, random_state=42):
self.K = K
self.max_iterations = max_iterations
self.tol = tol
self.random_state = random_state
self.centroids = None
self.assignments = None
self.history = {'inertia': [], 'iterations': 0}

def fit(self, X):
"""Fit K-Means to data."""
np.random.seed(self.random_state)
n_samples = X.shape[0]

# Initialize centroids
indices = np.random.choice(n_samples, self.K, replace=False)
self.centroids = X[indices].astype(np.float32).copy()

for iteration in range(self.max_iterations):
# Assign
distances = self._compute_distances(X)
assignments = np.argmin(distances, axis=1)

# Compute inertia (sum of squared distances)
inertia = np.sum(np.min(distances, axis=1))
self.history['inertia'].append(inertia)

# Update
old_centroids = self.centroids.copy()
for k in range(self.K):
mask = assignments == k
if np.sum(mask) > 0:
self.centroids[k] = X[mask].mean(axis=0)

# Check convergence
if np.max(np.abs(old_centroids - self.centroids)) < self.tol:
self.history['iterations'] = iteration
print(f"Converged at iteration {iteration}, inertia: {inertia:.4f}")
break

self.assignments = assignments
return self

def _compute_distances(self, X):
"""Compute squared distances to all centroids."""
X_sq = np.sum(X**2, axis=1, keepdims=True)
C_sq = np.sum(self.centroids**2, axis=1, keepdims=True)
cross = X @ self.centroids.T
distances = X_sq + C_sq.T - 2 * cross
return np.maximum(distances, 0) # Numerical stability

def predict(self, X):
"""Assign points to nearest centroid."""
distances = self._compute_distances(X)
return np.argmin(distances, axis=1)

def inertia(self, X):
"""Compute inertia (within-cluster sum of squares)."""
distances = self._compute_distances(X)
return np.sum(np.min(distances, axis=1))

# Usage
np.random.seed(42)
X = np.random.randn(1000, 10)
kmeans = KMeans(K=5, max_iterations=50)
kmeans.fit(X)
print(f"Final inertia: {kmeans.history['inertia'][-1]:.4f}")

# Predict on new data
X_new = np.random.randn(100, 10)
predictions = kmeans.predict(X_new)
print(f"Assignments shape: {predictions.shape}")
print(f"Unique clusters: {np.unique(predictions)}")

Comparison with Scikit-Learn

Verify our implementation matches a library:

from sklearn.cluster import KMeans as SklearnKMeans
import numpy as np

np.random.seed(42)
X = np.random.randn(1000, 10)

# Our implementation
kmeans_ours = KMeans(K=5, max_iterations=50, random_state=42)
kmeans_ours.fit(X)
inertia_ours = kmeans_ours.inertia(X)

# Scikit-learn
kmeans_sklearn = SklearnKMeans(n_clusters=5, max_iter=50, random_state=42, n_init=1)
kmeans_sklearn.fit(X)
inertia_sklearn = kmeans_sklearn.inertia_

print(f"Our inertia: {inertia_ours:.4f}")
print(f"Scikit-learn inertia: {inertia_sklearn:.4f}")
print(f"Difference: {abs(inertia_ours - inertia_sklearn):.4f}")
# Should be very close (differences due to initialization order)

Key Vectorization Lessons

  1. Broadcasting: Use X[:, np.newaxis, :] to align dimensions for pairwise operations.
  2. Algebraic identity: ||x - c||^2 = ||x||^2 - 2x·c + ||c||^2 reduces computation.
  3. Avoid inner loops: Pre-compute values and use vectorized updates where possible.
  4. Memory vs speed: float32 saves memory; float64 is more accurate. Choose based on data scale.
  5. Convergence: Monitor inertia or centroid changes to detect convergence early.

Key Takeaways

  • Vectorization provides 20–50x speedup over explicit loops even for complex algorithms.
  • Broadcasting enables computing pairwise distances without nested loops.
  • Algebraic tricks (e.g., distance formula identity) reduce algorithmic complexity.
  • Implementing ML algorithms from scratch deepens understanding; compare with libraries to validate.
  • Inertia (within-cluster sum of squares) is the objective K-Means minimizes; use it to evaluate convergence and cluster quality.

Frequently Asked Questions

Why does K-Means sometimes get stuck in local minima?

K-Means is not guaranteed to find the global optimum; it converges to a local minimum. Use multiple random initializations (random_state) and choose the result with lowest inertia, or try K-Means++ initialization.

How do I choose the number of clusters K?

Use the elbow method: fit K-Means for K = 1, 2, ..., 10 and plot inertia vs K. Look for an "elbow" where inertia decrease slows. Alternatively, use silhouette score or gap statistic.

Is K-Means suitable for high-dimensional data?

K-Means works but less effectively in high dimensions (curse of dimensionality). Consider dimensionality reduction (PCA) first, or use Gaussian Mixture Models (GMM) with covariance structure.

Can I use K-Means for categorical data?

No, K-Means requires continuous features and Euclidean distance. For categorical data, use K-Modes or hierarchical clustering.

How do I handle initialization sensitivity?

Use K-Means++ initialization (scikit-learn's default) instead of random, or run multiple times and select the best. K-Means++ gives better initial centroids, leading to faster convergence.

Further Reading