MLflow Experiment Tracking: Log Parameters & Metrics
MLflow is an open-source platform for managing the ML lifecycle, and its experiment tracking component is where most teams start their MLOps journey. Experiment tracking means logging every detail of a model training run—hyperparameters, metrics, code version, and model artifacts—so you can compare runs, reproduce the best one, and never lose track of what worked. In this article, you will learn how to instrument your training code with MLflow and use the web UI to analyze thousands of experiments at a glance.
How Experiment Tracking Solves the Notebook Problem
Data scientists often train models in Jupyter notebooks, manually tweaking hyperparameters and jotting down results in a spreadsheet or notebook cell. This approach breaks down fast:
- You forget which learning rate gave 93% accuracy and which gave 91%.
- A colleague runs the same experiment and gets slightly different results (different random seed, data order, library versions).
- You need to retrain the best model but realize you forgot to save the exact hyperparameters.
Experiment tracking solves this. Instead of manual notes, every run is automatically logged: parameters, metrics, code version, environment, and model. The MLflow UI lets you sort, filter, and compare runs. You can ask "What was the learning rate of the run with the highest accuracy?" and get the answer in seconds.
Setting Up MLflow (Local)
MLflow is lightweight and requires no infrastructure setup for local development. Install it via pip:
pip install mlflow scikit-learn pandas numpy
By default, MLflow stores experiments and runs in a mlruns/ directory in your current working directory. For a team, you would run an MLflow server (see article 8), but for now, local is fine.
Start the MLflow UI:
mlflow ui
This launches a web server on http://localhost:5000. Open it in your browser. You will see a list of experiments (currently empty). Now let's log a run.
Logging Your First Experiment
Here's a complete example: train a decision tree classifier and log the parameters and metrics to MLflow.
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Start a new MLflow run
mlflow.start_run()
# Log parameters (hyperparameters)
max_depth = 5
min_samples_split = 2
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("min_samples_split", min_samples_split)
# Train the model
model = DecisionTreeClassifier(
max_depth=max_depth,
min_samples_split=min_samples_split,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate and log metrics
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="weighted")
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("precision", precision)
# Log the model itself
mlflow.sklearn.log_model(model, "model")
# End the run
mlflow.end_run()
print(f"Run logged! Accuracy: {accuracy:.3f}")
Run this script. Check the MLflow UI at http://localhost:5000. You should see a new experiment (default name) with one run. Click on the run and inspect the parameters, metrics, and logged model. This is the power of experiment tracking in action: all details in one place.
Organizing Experiments with Tags and Names
When you run dozens of experiments, you need organization. MLflow lets you tag runs, set custom run names, and group runs into experiments.
import mlflow
from datetime import datetime
# Create or get a specific experiment
experiment_name = "Iris Classification - Decision Tree Tuning"
mlflow.set_experiment(experiment_name)
# Start a run with a custom name
run_name = f"max_depth_{max_depth}__{datetime.now().strftime('%Y%m%d_%H%M%S')}"
mlflow.start_run(run_name=run_name)
# Add tags for filtering and organization
mlflow.set_tag("model_type", "decision_tree")
mlflow.set_tag("dataset", "iris")
mlflow.set_tag("author", "alice")
mlflow.set_tag("phase", "hyperparameter_tuning")
# Log params and metrics as before...
Tags are freeform key-value pairs that help you organize and filter runs. In the MLflow UI, you can filter by tags to see only runs matching your criteria (e.g., all runs by "alice" on the "iris" dataset). Run names make it easy to identify a run at a glance.
Comparing Multiple Runs
The real power of experiment tracking is comparing runs side-by-side. Let's train multiple models with different hyperparameters and log them all.
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Setup
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
mlflow.set_experiment("Model Comparison - Iris")
# Experiment 1: Decision Tree with max_depth=3
mlflow.start_run(run_name="DecisionTree_depth3")
mlflow.log_param("model", "DecisionTree")
mlflow.log_param("max_depth", 3)
dt_model = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_model.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt_model.predict(X_test))
mlflow.log_metric("accuracy", dt_acc)
mlflow.sklearn.log_model(dt_model, "model")
mlflow.end_run()
# Experiment 2: Decision Tree with max_depth=7
mlflow.start_run(run_name="DecisionTree_depth7")
mlflow.log_param("model", "DecisionTree")
mlflow.log_param("max_depth", 7)
dt_model = DecisionTreeClassifier(max_depth=7, random_state=42)
dt_model.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt_model.predict(X_test))
mlflow.log_metric("accuracy", dt_acc)
mlflow.sklearn.log_model(dt_model, "model")
mlflow.end_run()
# Experiment 3: Random Forest
mlflow.start_run(run_name="RandomForest_100trees")
mlflow.log_param("model", "RandomForest")
mlflow.log_param("n_estimators", 100)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf_model.predict(X_test))
mlflow.log_metric("accuracy", rf_acc)
mlflow.sklearn.log_model(rf_model, "model")
mlflow.end_run()
print("Three runs logged. Visit MLflow UI to compare!")
In the MLflow UI, click on the experiment and select the three runs. Choose "Compare" to see a side-by-side table of parameters and metrics. This makes it obvious which model performed best and which hyperparameters were most effective.
Logging Artifacts: Code, Data, and Plots
Beyond parameters and metrics, you can log artifacts: files like code snapshots, plots, data samples, or model explanations.
import mlflow
import matplotlib.pyplot as plt
import json
mlflow.start_run(run_name="artifact_example")
# Log a hyperparameter file (JSON)
params = {
"model": "RandomForest",
"n_estimators": 100,
"max_depth": 10,
"random_state": 42
}
with open("/tmp/params.json", "w") as f:
json.dump(params, f)
mlflow.log_artifact("/tmp/params.json", artifact_path="config")
# Log a feature importance plot
importances = [0.35, 0.25, 0.20, 0.15, 0.05]
feature_names = ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth", "Species"]
plt.figure(figsize=(8, 4))
plt.barh(feature_names, importances)
plt.xlabel("Importance")
plt.title("Feature Importance")
plt.tight_layout()
plt.savefig("/tmp/feature_importance.png")
plt.close()
mlflow.log_artifact("/tmp/feature_importance.png", artifact_path="plots")
# Log model metadata as text
mlflow.log_text("Model trained on Iris dataset. 80/20 train/test split.", "model_notes.txt")
mlflow.end_run()
Artifacts are stored in the MLflow backend (e.g., local filesystem or S3) and are linked to the run. In the UI, you can download artifacts or view plots directly.
Querying Runs Programmatically
Sometimes you need to fetch runs and metrics programmatically, not through the UI. MLflow's MlflowClient allows this.
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Get all runs in the current experiment
experiment = client.get_experiment_by_name("Model Comparison - Iris")
runs = client.search_runs(experiment_ids=[experiment.experiment_id])
# Find the run with the highest accuracy
best_run = max(runs, key=lambda r: r.data.metrics.get("accuracy", 0))
print(f"Best run: {best_run.info.run_name}")
print(f"Accuracy: {best_run.data.metrics['accuracy']:.3f}")
print(f"Parameters: {best_run.data.params}")
# Load the best model and make predictions
model_uri = f"runs:/{best_run.info.run_id}/model"
best_model = mlflow.pyfunc.load_model(model_uri)
print(f"Model loaded from {model_uri}")
This is how you automate the "find the best model" step. In a production pipeline, you would do this to automatically select the top-performing model for deployment.
Key Takeaways
- Experiment tracking logs every training run's parameters, metrics, and artifacts in a centralized place.
- MLflow is lightweight, installable via pip, and starts with a local filesystem backend.
- Use
mlflow.log_param()for hyperparameters,mlflow.log_metric()for results, andmlflow.log_artifact()for files. - Tags and custom run names help organize hundreds of experiments for easy filtering and sorting.
- The MLflow UI's compare feature makes it simple to identify the best models and best hyperparameters.
- The
MlflowClientAPI lets you query runs programmatically, enabling automation.
Frequently Asked Questions
How do I change the MLflow backend from local files to a server?
MLflow stores run data locally in mlruns/ by default. To use a remote server (article 8 covers this in depth), set the tracking URI: mlflow.set_tracking_uri("http://localhost:5000") (if running mlflow server locally) or point to a remote MLflow server with mlflow.set_tracking_uri("https://mlflow.example.com").
Can I log images and plots directly to MLflow?
Yes. Use mlflow.log_figure() for matplotlib figures or mlflow.log_image() for image files. Both are automatically rendered in the UI.
What's the difference between log_metric() and log_param()?
Parameters are hyperparameters (fixed before training): learning rate, batch size, model type. Metrics are outputs: accuracy, loss, precision. Parameters do not change during training; metrics might (e.g., loss per epoch). Log each appropriately for clarity.
Can I log metrics multiple times during training (e.g., loss per epoch)?
Yes. Use mlflow.log_metric("loss", loss_value, step=epoch) to log the metric with a step number. MLflow will plot the metric over steps in the UI, showing training progress.
Is there overhead to logging experiments?
Minimal. Logging is asynchronous and typically adds less than 1% to training time. The benefit of reproducibility far outweighs the cost.
Further Reading
- MLflow Tracking Documentation — Official guide to logging and querying runs.
- MLflow Model Registry Documentation — Next step after experiment tracking.
- "Experimenting with ML" (Sculley et al., 2019) — Academic paper on best practices for ML experimentation.
- Weights & Biases Experiment Tracking Comparison — Alternative tool with similar functionality.