Skip to main content

Classification Metrics: Evaluate Your Models Right

Accuracy alone is a dangerous metric for classification. If your dataset is 99% negative (spam detection), a model that predicts all negative achieves 99% accuracy while catching zero spam. Classification metrics—precision, recall, F1, and ROC-AUC—measure different aspects of model performance. Choosing the right metric depends on your task: precision for spam (false positives are costly), recall for disease detection (false negatives are costly), F1 when you care about both equally. The confusion matrix reveals exactly where your model fails.

The Confusion Matrix: Foundation of All Metrics

The confusion matrix tabulates predictions vs. ground truth for binary classification:

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data: binary classification (0 = benign, 1 = malignant)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)

# Train a classifier
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix: rows = true labels, columns = predicted labels
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f" True Negatives (TN): {cm[0, 0]}")
print(f" False Positives (FP): {cm[0, 1]}")
print(f" False Negatives (FN): {cm[1, 0]}")
print(f" True Positives (TP): {cm[1, 1]}")

# Visualize
import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.show()

The confusion matrix reveals:

  • TN (top-left): Correctly predicted negative
  • FP (top-right): Incorrectly predicted positive (false alarm)
  • FN (bottom-left): Incorrectly predicted negative (missed case)
  • TP (bottom-right): Correctly predicted positive

All other metrics derive from these four values.

Precision and Recall: Complementary Metrics

Precision and recall capture different aspects of classifier quality:

from sklearn.metrics import precision_score, recall_score, f1_score

# Precision: Of all positive predictions, how many were correct?
# TP / (TP + FP) — lower false positive rate
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.3f}")

# Recall: Of all actual positives, how many did we catch?
# TP / (TP + FN) — lower false negative rate
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.3f}")

# F1: Harmonic mean of precision and recall
# 2 * (precision * recall) / (precision + recall)
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")
MetricFormulaAnswersBest For
PrecisionTP / (TP + FP)"Of my positive predictions, how many are correct?"Spam detection (false positives are costly)
RecallTP / (TP + FN)"Of all true positives, how many did I find?"Disease detection (false negatives are costly)
F12×precision×recall / (precision+recall)Harmonic meanBalanced precision-recall tradeoff

High precision, low recall: the model is conservative, predicting positive rarely, but accurately. Low precision, high recall: the model is aggressive, predicting positive often, catching most cases but with false alarms.

Trade-Off: Precision vs. Recall

Many classifiers have a decision threshold (e.g., probability 0.5). Lowering the threshold increases recall (catch more positives) but decreases precision (more false positives):

from sklearn.metrics import precision_recall_curve
import numpy as np

# Get predicted probabilities (not binary predictions)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Compute precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)

# Plot the curve
plt.figure(figsize=(10, 5))
plt.plot(recalls, precisions, label='Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid()
plt.show()

# Choose threshold based on task
# For disease detection (minimize false negatives): choose high recall (e.g., 0.9)
# For spam detection (minimize false positives): choose high precision (e.g., 0.95)

# Find threshold for 90% recall
idx_90_recall = np.argmax(recalls >= 0.90)
threshold_90_recall = thresholds[idx_90_recall]
print(f"Threshold for 90% recall: {threshold_90_recall:.3f}")

# Apply custom threshold
y_pred_custom = (y_pred_proba >= threshold_90_recall).astype(int)
precision_custom = precision_score(y_test, y_pred_custom)
recall_custom = recall_score(y_test, y_pred_custom)
print(f"Custom threshold - Precision: {precision_custom:.3f}, Recall: {recall_custom:.3f}")

Adjust the threshold to match your task's cost structure.

ROC Curve and AUC: Threshold-Agnostic Evaluation

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs. False Positive Rate across all thresholds. AUC (Area Under Curve) is a single number summarizing overall discriminative ability:

from sklearn.metrics import roc_curve, auc, roc_auc_score

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Compute AUC (area under curve)
roc_auc = auc(fpr, tpr)
print(f"ROC-AUC: {roc_auc:.3f}")

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid()
plt.show()

ROC-AUC is ideal for imbalanced datasets:

  • 0.5 = random guessing
  • 1.0 = perfect classification
  • Above 0.5 = better than random

ROC-AUC is threshold-agnostic: it evaluates the model across all decision boundaries, not just the default 0.5.

Metrics for Imbalanced Classification

When classes are imbalanced (e.g., 95% negative, 5% positive), use weighted averages:

from sklearn.metrics import precision_score, recall_score, f1_score

# Macro-average: unweighted mean across classes (gives equal weight to each class)
f1_macro = f1_score(y_test, y_pred, average='macro')

# Weighted-average: weighted by support (class frequency)
f1_weighted = f1_score(y_test, y_pred, average='weighted')

# Micro-average: aggregate TP, FP, FN, compute metrics globally
f1_micro = f1_score(y_test, y_pred, average='micro')

print(f"F1 Macro: {f1_macro:.3f}")
print(f"F1 Weighted: {f1_weighted:.3f}")
print(f"F1 Micro: {f1_micro:.3f}")

# For imbalanced data, use macro or weighted; micro is equivalent to accuracy
# Macro treats minority class with equal weight (good for balancing)
# Weighted accounts for class frequency (reflects real-world distribution)

For imbalanced data:

  • Macro-average: Gives equal weight to each class (good for balancing minority class performance)
  • Weighted-average: Weights by class frequency (reflects real-world imbalance)
  • ROC-AUC: Naturally robust to imbalance

Multi-Class Classification Metrics

For multi-class problems (3+ classes), metrics extend naturally:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification report: precision, recall, F1 per class
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix for 3 classes
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# For multi-class ROC-AUC, use one-vs-rest
y_pred_proba = model.predict_proba(X_test)
roc_auc_ovr = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')
print(f"ROC-AUC (One-vs-Rest): {roc_auc_ovr:.3f}")

The classification_report shows per-class metrics, revealing if your model struggles on particular classes.

Choosing the Right Metric for Your Task

TaskPrimary MetricWhy
Spam detectionPrecisionFalse positives (blocking legit emails) are costly
Disease screeningRecallFalse negatives (missing disease) are costly
Fraud detectionROC-AUCBalance false alarms and missed fraud; handles imbalance
Balanced problemF1 or AccuracyNo special cost structure
Imbalanced problemWeighted F1 or ROC-AUCAccounts for class imbalance
Ranking systemAverage PrecisionCares about ordering, not just binary correct/incorrect

Key Takeaways

  • Never use accuracy alone for imbalanced datasets; it hides poor minority class performance.
  • Precision and recall are complementary: adjust threshold based on task (FP vs. FN cost).
  • F1 balances precision and recall; use when both matter equally.
  • ROC-AUC evaluates across all thresholds; ideal for imbalanced classification and ranking.
  • Confusion matrix reveals exactly where your model fails (FP vs. FN patterns).

Frequently Asked Questions

Which metric is "best" overall?

It depends on your task. For most real-world problems, start with ROC-AUC (threshold-agnostic, handles imbalance well), then choose precision/recall based on the cost of false positives vs. false negatives.

Can I use precision and recall for regression?

No. Precision and recall are classification-only. For regression, use MAE, RMSE, or R² (covered in the next article).

What is macro vs. weighted average?

Macro treats each class equally (good for balancing minority class). Weighted accounts for class frequency. For imbalanced data, macro often reveals class-specific performance better than weighted.

Should I always aim for high F1?

F1 is useful when precision and recall are equally important. If your task has asymmetric costs (disease detection cares more about recall), optimize for that specific metric instead.

How do I interpret ROC-AUC of 0.85?

An AUC of 0.85 means: if you pick a random positive sample and a random negative sample, the model ranks the positive sample higher 85% of the time. It is a strong discriminator but not perfect (0.90+ is excellent).

Further Reading