Data Versioning with DVC: Version Datasets Like Code
Data Version Control (DVC) is an open-source tool for versioning datasets and ML pipelines, much like Git for data. While Git handles code (text files), DVC handles large binary files (datasets, models, artifacts) and tracks the dependencies between them. Combined with MLflow and Git, DVC gives you a complete, reproducible ML system where you can ask "What data trained this model?" and answer with certainty.
Why Data Versioning Matters
Consider this scenario: you train a model on sales data from January 2026 and achieve 89% accuracy. In April 2026, someone reruns the same training code and gets 84% accuracy. What changed? The data changed. Without data versioning, you cannot answer this question. With DVC, every dataset is tracked: you know exactly which version of the data trained which model, and you can reproduce it.
Data versioning also prevents accidental data loss and ensures compliance: regulated industries must audit "which data was used for this model?"
Installing and Initializing DVC
Install DVC via pip:
pip install dvc
Initialize DVC in your project (alongside Git):
git init
dvc init
This creates a .dvc/ directory (analogous to .git/) and a .dvc/.gitignore file. Commit these to Git:
git add .dvc/
git commit -m "Initialize DVC"
Versioning Data Files
Track large data files with DVC:
dvc add data/training_data.csv
DVC creates a data/training_data.csv.dvc file (small, text-based, similar to .git metadata). This .dvc file is committed to Git; the CSV itself is stored locally and referenced by a hash.
git add data/training_data.csv.dvc .gitignore
git commit -m "Add training data version v1"
The .dvc file contains:
# data/training_data.csv.dvc
outs:
- path: training_data.csv
md5: a1b2c3d4e5f6...
size: 50000000
The md5 hash uniquely identifies this version of the data. If the data changes, the hash changes, and Git tracks the difference.
To retrieve the data later:
dvc pull
This downloads the data from the remote storage (configured next).
Configuring Remote Storage
Local .dvc/ storage is fine for a solo project, but for teams, use remote storage (S3, Google Cloud Storage, Azure Blob, or a simple HTTP server). Configure it:
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote access_key_id <YOUR_KEY>
dvc remote modify myremote secret_access_key <YOUR_SECRET>
Push data to the remote:
dvc push
Teammates can pull the data:
dvc pull
This is how teams share large datasets without committing them to Git.
Versioning with Git Tags
When your data is stable, tag it in Git:
git tag -a data_v1 -m "Training data version 1: Jan 2026 sales data"
git push origin data_v1
Teammates can check out this exact version:
git checkout data_v1
dvc pull # Retrieve the corresponding data
Creating Reproducible Pipelines with DVC
DVC excels at defining and executing ML pipelines. A pipeline is a sequence of steps (fetch data, preprocess, train, evaluate) where each step depends on others. DVC ensures reproducibility: if data changes, it reruns all downstream steps.
Create a dvc.yaml file in your project root:
# dvc.yaml
stages:
prepare:
cmd: python scripts/prepare_data.py
deps:
- scripts/prepare_data.py
- data/raw.csv
outs:
- data/processed.csv
metrics:
- metrics.json:
cache: false
train:
cmd: python scripts/train_model.py
deps:
- scripts/train_model.py
- data/processed.csv
outs:
- model.pkl
metrics:
- train_metrics.json:
cache: false
evaluate:
cmd: python scripts/evaluate_model.py
deps:
- scripts/evaluate_model.py
- model.pkl
- data/processed.csv
metrics:
- eval_metrics.json:
cache: false
Each stage has:
cmd: Command to run.deps: Input files (data or code).outs: Output files (data or models).metrics: Evaluation metrics (not cached, always visible).
Run the pipeline:
dvc repro
DVC executes the stages in order, checking dependencies. If you change data/raw.csv, DVC reruns prepare, train, and evaluate. If you only change a hyperparameter in scripts/train_model.py, it reruns only train and evaluate (since prepare is unchanged).
Python Scripts in a DVC Pipeline
Here's a concrete example. Create scripts/prepare_data.py:
# scripts/prepare_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
import json
# Load raw data
df = pd.read_csv("data/raw.csv")
# Preprocessing
df = df.dropna()
df["age_scaled"] = (df["age"] - df["age"].mean()) / df["age"].std()
# Train-test split
X = df[["age_scaled", "income"]]
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Save processed data
X_train.to_csv("data/processed.csv", index=False)
# Log metrics
metrics = {
"rows_before": len(df),
"rows_after": len(X_train) + len(X_test),
}
with open("metrics.json", "w") as f:
json.dump(metrics, f)
print("Data prepared and saved.")
Create scripts/train_model.py:
# scripts/train_model.py
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pickle
import json
# Load processed data
data = pd.read_csv("data/processed.csv")
X_train = data.iloc[:, :-1]
y_train = data.iloc[:, -1]
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Save model
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
# Log metrics
metrics = {"train_accuracy": float(model.score(X_train, y_train))}
with open("train_metrics.json", "w") as f:
json.dump(metrics, f)
print(f"Model trained. Accuracy: {metrics['train_accuracy']:.3f}")
Create scripts/evaluate_model.py:
# scripts/evaluate_model.py
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score
import pickle
import json
# Load data and model
data = pd.read_csv("data/processed.csv")
X_test = data.iloc[:, :-1]
y_test = data.iloc[:, -1]
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="weighted")
metrics = {
"accuracy": float(accuracy),
"precision": float(precision),
}
with open("eval_metrics.json", "w") as f:
json.dump(metrics, f)
print(f"Evaluation complete. Accuracy: {accuracy:.3f}")
Run the pipeline:
dvc repro
DVC executes all three scripts in order, handling dependencies automatically. View the results:
dvc metrics show
This displays metrics from eval_metrics.json and other metrics files.
Comparing Pipeline Experiments
DVC tracks pipeline metrics across commits. Compare two branches:
dvc plots diff main feature/new_preprocessing
This shows how metrics changed between the main branch and your feature branch, helping you decide if the change is worth merging.
Integrating DVC with MLflow
Combine MLflow (for experiment tracking) with DVC (for data and pipeline versioning):
import mlflow
import dvc.api
# Get the data version hash from DVC
data_hash = dvc.api.get_url(
"data/processed.csv",
repo=".",
rev="HEAD"
).split("/")[-1]
mlflow.log_param("data_version", data_hash)
# ... training code ...
mlflow.sklearn.log_model(model, "model")
Now every MLflow experiment logs the exact data version it used. When investigating an old model, you can retrieve that data version and retrain locally.
Key Takeaways
- DVC versions datasets like Git versions code, using content hashes to uniquely identify each version.
- Data is stored locally or on remote storage (S3, GCS, etc.); only small
.dvcmetadata files are committed to Git. - DVC pipelines define ML workflows as DAGs (directed acyclic graphs) of steps with dependencies.
- Running
dvc reproexecutes only the steps whose inputs changed, saving time and ensuring reproducibility. - Combined with Git tags and MLflow, DVC gives complete reproducibility: data version, code version, and model all tracked together.
Frequently Asked Questions
Can I use DVC with existing large files already in my repository?
Yes, but first remove them from Git to save space. Then add them with DVC. Use dvc mv to reorganize or dvc add to begin tracking existing files.
Does DVC require a remote backend?
No, it works locally out of the box. Remote storage is optional and recommended for teams. For a solo project, local storage in .dvc/ is sufficient.
How does DVC handle branching and merging in Git?
When you switch Git branches, check out the corresponding data with dvc pull. If two branches modify the same data file, merge the .dvc files like any other text file and resolve conflicts in the MD5 hashes. DVC provides tooling to help (dvc fetch, dvc checkout).
Can I version models with DVC?
Yes. Add model files with dvc add model.pkl or define them as outputs in dvc.yaml pipelines. This is less common than versioning data, since MLflow handles model versioning, but it is possible.
What if my dataset is too large to push to remote storage frequently?
Commit only the .dvc metadata to Git, which is tiny. Push data to remote storage only when stable or when collaborators need it. DVC supports selective pushes: dvc push <stage_name> pushes only specific outputs.
Further Reading
- DVC Official Documentation — Complete guide to DVC features.
- DVC Pipelines Documentation — Deep dive into reproducible workflows.
- "Data Versioning for Machine Learning" (Schelter et al., 2017) — Research on data versioning systems.
- Combining MLflow and DVC — Integration strategies.