Classification Metrics: Evaluate Your Models Right
Accuracy alone is a dangerous metric for classification. If your dataset is 99% negative (spam detection), a model that predicts all negative achieves 99% accuracy while catching zero spam. Classification metrics—precision, recall, F1, and ROC-AUC—measure different aspects of model performance. Choosing the right metric depends on your task: precision for spam (false positives are costly), recall for disease detection (false negatives are costly), F1 when you care about both equally. The confusion matrix reveals exactly where your model fails.
The Confusion Matrix: Foundation of All Metrics
The confusion matrix tabulates predictions vs. ground truth for binary classification:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load data: binary classification (0 = benign, 1 = malignant)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Train a classifier
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Confusion matrix: rows = true labels, columns = predicted labels
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f" True Negatives (TN): {cm[0, 0]}")
print(f" False Positives (FP): {cm[0, 1]}")
print(f" False Negatives (FN): {cm[1, 0]}")
print(f" True Positives (TP): {cm[1, 1]}")
# Visualize
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.show()
The confusion matrix reveals:
- TN (top-left): Correctly predicted negative
- FP (top-right): Incorrectly predicted positive (false alarm)
- FN (bottom-left): Incorrectly predicted negative (missed case)
- TP (bottom-right): Correctly predicted positive
All other metrics derive from these four values.
Precision and Recall: Complementary Metrics
Precision and recall capture different aspects of classifier quality:
from sklearn.metrics import precision_score, recall_score, f1_score
# Precision: Of all positive predictions, how many were correct?
# TP / (TP + FP) — lower false positive rate
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.3f}")
# Recall: Of all actual positives, how many did we catch?
# TP / (TP + FN) — lower false negative rate
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.3f}")
# F1: Harmonic mean of precision and recall
# 2 * (precision * recall) / (precision + recall)
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")
| Metric | Formula | Answers | Best For |
|---|---|---|---|
| Precision | TP / (TP + FP) | "Of my positive predictions, how many are correct?" | Spam detection (false positives are costly) |
| Recall | TP / (TP + FN) | "Of all true positives, how many did I find?" | Disease detection (false negatives are costly) |
| F1 | 2×precision×recall / (precision+recall) | Harmonic mean | Balanced precision-recall tradeoff |
High precision, low recall: the model is conservative, predicting positive rarely, but accurately. Low precision, high recall: the model is aggressive, predicting positive often, catching most cases but with false alarms.
Trade-Off: Precision vs. Recall
Many classifiers have a decision threshold (e.g., probability 0.5). Lowering the threshold increases recall (catch more positives) but decreases precision (more false positives):
from sklearn.metrics import precision_recall_curve
import numpy as np
# Get predicted probabilities (not binary predictions)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Compute precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
# Plot the curve
plt.figure(figsize=(10, 5))
plt.plot(recalls, precisions, label='Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid()
plt.show()
# Choose threshold based on task
# For disease detection (minimize false negatives): choose high recall (e.g., 0.9)
# For spam detection (minimize false positives): choose high precision (e.g., 0.95)
# Find threshold for 90% recall
idx_90_recall = np.argmax(recalls >= 0.90)
threshold_90_recall = thresholds[idx_90_recall]
print(f"Threshold for 90% recall: {threshold_90_recall:.3f}")
# Apply custom threshold
y_pred_custom = (y_pred_proba >= threshold_90_recall).astype(int)
precision_custom = precision_score(y_test, y_pred_custom)
recall_custom = recall_score(y_test, y_pred_custom)
print(f"Custom threshold - Precision: {precision_custom:.3f}, Recall: {recall_custom:.3f}")
Adjust the threshold to match your task's cost structure.
ROC Curve and AUC: Threshold-Agnostic Evaluation
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs. False Positive Rate across all thresholds. AUC (Area Under Curve) is a single number summarizing overall discriminative ability:
from sklearn.metrics import roc_curve, auc, roc_auc_score
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# Compute AUC (area under curve)
roc_auc = auc(fpr, tpr)
print(f"ROC-AUC: {roc_auc:.3f}")
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid()
plt.show()
ROC-AUC is ideal for imbalanced datasets:
- 0.5 = random guessing
- 1.0 = perfect classification
- Above 0.5 = better than random
ROC-AUC is threshold-agnostic: it evaluates the model across all decision boundaries, not just the default 0.5.
Metrics for Imbalanced Classification
When classes are imbalanced (e.g., 95% negative, 5% positive), use weighted averages:
from sklearn.metrics import precision_score, recall_score, f1_score
# Macro-average: unweighted mean across classes (gives equal weight to each class)
f1_macro = f1_score(y_test, y_pred, average='macro')
# Weighted-average: weighted by support (class frequency)
f1_weighted = f1_score(y_test, y_pred, average='weighted')
# Micro-average: aggregate TP, FP, FN, compute metrics globally
f1_micro = f1_score(y_test, y_pred, average='micro')
print(f"F1 Macro: {f1_macro:.3f}")
print(f"F1 Weighted: {f1_weighted:.3f}")
print(f"F1 Micro: {f1_micro:.3f}")
# For imbalanced data, use macro or weighted; micro is equivalent to accuracy
# Macro treats minority class with equal weight (good for balancing)
# Weighted accounts for class frequency (reflects real-world distribution)
For imbalanced data:
- Macro-average: Gives equal weight to each class (good for balancing minority class performance)
- Weighted-average: Weights by class frequency (reflects real-world imbalance)
- ROC-AUC: Naturally robust to imbalance
Multi-Class Classification Metrics
For multi-class problems (3+ classes), metrics extend naturally:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Classification report: precision, recall, F1 per class
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Confusion matrix for 3 classes
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# For multi-class ROC-AUC, use one-vs-rest
y_pred_proba = model.predict_proba(X_test)
roc_auc_ovr = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')
print(f"ROC-AUC (One-vs-Rest): {roc_auc_ovr:.3f}")
The classification_report shows per-class metrics, revealing if your model struggles on particular classes.
Choosing the Right Metric for Your Task
| Task | Primary Metric | Why |
|---|---|---|
| Spam detection | Precision | False positives (blocking legit emails) are costly |
| Disease screening | Recall | False negatives (missing disease) are costly |
| Fraud detection | ROC-AUC | Balance false alarms and missed fraud; handles imbalance |
| Balanced problem | F1 or Accuracy | No special cost structure |
| Imbalanced problem | Weighted F1 or ROC-AUC | Accounts for class imbalance |
| Ranking system | Average Precision | Cares about ordering, not just binary correct/incorrect |
Key Takeaways
- Never use accuracy alone for imbalanced datasets; it hides poor minority class performance.
- Precision and recall are complementary: adjust threshold based on task (FP vs. FN cost).
- F1 balances precision and recall; use when both matter equally.
- ROC-AUC evaluates across all thresholds; ideal for imbalanced classification and ranking.
- Confusion matrix reveals exactly where your model fails (FP vs. FN patterns).
Frequently Asked Questions
Which metric is "best" overall?
It depends on your task. For most real-world problems, start with ROC-AUC (threshold-agnostic, handles imbalance well), then choose precision/recall based on the cost of false positives vs. false negatives.
Can I use precision and recall for regression?
No. Precision and recall are classification-only. For regression, use MAE, RMSE, or R² (covered in the next article).
What is macro vs. weighted average?
Macro treats each class equally (good for balancing minority class). Weighted accounts for class frequency. For imbalanced data, macro often reveals class-specific performance better than weighted.
Should I always aim for high F1?
F1 is useful when precision and recall are equally important. If your task has asymmetric costs (disease detection cares more about recall), optimize for that specific metric instead.
How do I interpret ROC-AUC of 0.85?
An AUC of 0.85 means: if you pick a random positive sample and a random negative sample, the model ranks the positive sample higher 85% of the time. It is a strong discriminator but not perfect (0.90+ is excellent).