Data Versioning with DVC: Version Datasets Like Code

Data Version Control (DVC) is an open-source tool for versioning datasets and ML pipelines, much like Git for data. While Git handles code (text files), DVC handles large binary files (datasets, models, artifacts) and tracks the dependencies between them. Combined with MLflow and Git, DVC gives you a complete, reproducible ML system where you can ask "What data trained this model?" and answer with certainty.

Why Data Versioning Matters

Consider this scenario: you train a model on sales data from January 2026 and achieve 89% accuracy. In April 2026, someone reruns the same training code and gets 84% accuracy. What changed? The data changed. Without data versioning, you cannot answer this question. With DVC, every dataset is tracked: you know exactly which version of the data trained which model, and you can reproduce it.

Data versioning also prevents accidental data loss and ensures compliance: regulated industries must audit "which data was used for this model?"

Installing and Initializing DVC

Install DVC via pip:

pip install dvc

Initialize DVC in your project (alongside Git):

git init
dvc init

This creates a .dvc/ directory (analogous to .git/) and a .dvc/.gitignore file. Commit these to Git:

git add .dvc/
git commit -m "Initialize DVC"

Versioning Data Files

Track large data files with DVC:

dvc add data/training_data.csv

DVC creates a data/training_data.csv.dvc file (small, text-based, similar to .git metadata). This .dvc file is committed to Git; the CSV itself is stored locally and referenced by a hash.

git add data/training_data.csv.dvc .gitignore
git commit -m "Add training data version v1"

The .dvc file contains:

# data/training_data.csv.dvc
outs:
- path: training_data.csv
  md5: a1b2c3d4e5f6...
  size: 50000000

The md5 hash uniquely identifies this version of the data. If the data changes, the hash changes, and Git tracks the difference.

To retrieve the data later:

dvc pull

This downloads the data from the remote storage (configured next).

Configuring Remote Storage

Local .dvc/ storage is fine for a solo project, but for teams, use remote storage (S3, Google Cloud Storage, Azure Blob, or a simple HTTP server). Configure it:

dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote access_key_id <YOUR_KEY>
dvc remote modify myremote secret_access_key <YOUR_SECRET>

Push data to the remote:

dvc push

Teammates can pull the data:

dvc pull

This is how teams share large datasets without committing them to Git.

Versioning with Git Tags

When your data is stable, tag it in Git:

git tag -a data_v1 -m "Training data version 1: Jan 2026 sales data"
git push origin data_v1

Teammates can check out this exact version:

git checkout data_v1
dvc pull  # Retrieve the corresponding data

Creating Reproducible Pipelines with DVC

DVC excels at defining and executing ML pipelines. A pipeline is a sequence of steps (fetch data, preprocess, train, evaluate) where each step depends on others. DVC ensures reproducibility: if data changes, it reruns all downstream steps.

Create a dvc.yaml file in your project root:

# dvc.yaml
stages:
  prepare:
    cmd: python scripts/prepare_data.py
    deps:
      - scripts/prepare_data.py
      - data/raw.csv
    outs:
      - data/processed.csv
    metrics:
      - metrics.json:
          cache: false

  train:
    cmd: python scripts/train_model.py
    deps:
      - scripts/train_model.py
      - data/processed.csv
    outs:
      - model.pkl
    metrics:
      - train_metrics.json:
          cache: false

  evaluate:
    cmd: python scripts/evaluate_model.py
    deps:
      - scripts/evaluate_model.py
      - model.pkl
      - data/processed.csv
    metrics:
      - eval_metrics.json:
          cache: false

Each stage has:

cmd: Command to run.
deps: Input files (data or code).
outs: Output files (data or models).
metrics: Evaluation metrics (not cached, always visible).

Run the pipeline:

dvc repro

DVC executes the stages in order, checking dependencies. If you change data/raw.csv, DVC reruns prepare, train, and evaluate. If you only change a hyperparameter in scripts/train_model.py, it reruns only train and evaluate (since prepare is unchanged).

Python Scripts in a DVC Pipeline

Here's a concrete example. Create scripts/prepare_data.py:

# scripts/prepare_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
import json

# Load raw data
df = pd.read_csv("data/raw.csv")

# Preprocessing
df = df.dropna()
df["age_scaled"] = (df["age"] - df["age"].mean()) / df["age"].std()

# Train-test split
X = df[["age_scaled", "income"]]
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Save processed data
X_train.to_csv("data/processed.csv", index=False)

# Log metrics
metrics = {
    "rows_before": len(df),
    "rows_after": len(X_train) + len(X_test),
}
with open("metrics.json", "w") as f:
    json.dump(metrics, f)

print("Data prepared and saved.")

Create scripts/train_model.py:

# scripts/train_model.py
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pickle
import json

# Load processed data
data = pd.read_csv("data/processed.csv")
X_train = data.iloc[:, :-1]
y_train = data.iloc[:, -1]

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Save model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)

# Log metrics
metrics = {"train_accuracy": float(model.score(X_train, y_train))}
with open("train_metrics.json", "w") as f:
    json.dump(metrics, f)

print(f"Model trained. Accuracy: {metrics['train_accuracy']:.3f}")

Create scripts/evaluate_model.py:

# scripts/evaluate_model.py
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score
import pickle
import json

# Load data and model
data = pd.read_csv("data/processed.csv")
X_test = data.iloc[:, :-1]
y_test = data.iloc[:, -1]

with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="weighted")

metrics = {
    "accuracy": float(accuracy),
    "precision": float(precision),
}
with open("eval_metrics.json", "w") as f:
    json.dump(metrics, f)

print(f"Evaluation complete. Accuracy: {accuracy:.3f}")

Run the pipeline:

dvc repro

DVC executes all three scripts in order, handling dependencies automatically. View the results:

dvc metrics show

This displays metrics from eval_metrics.json and other metrics files.

Comparing Pipeline Experiments

DVC tracks pipeline metrics across commits. Compare two branches:

dvc plots diff main feature/new_preprocessing

This shows how metrics changed between the main branch and your feature branch, helping you decide if the change is worth merging.

Integrating DVC with MLflow

Combine MLflow (for experiment tracking) with DVC (for data and pipeline versioning):

import mlflow
import dvc.api

# Get the data version hash from DVC
data_hash = dvc.api.get_url(
    "data/processed.csv",
    repo=".",
    rev="HEAD"
).split("/")[-1]

mlflow.log_param("data_version", data_hash)
# ... training code ...
mlflow.sklearn.log_model(model, "model")

Now every MLflow experiment logs the exact data version it used. When investigating an old model, you can retrieve that data version and retrain locally.

Key Takeaways

DVC versions datasets like Git versions code, using content hashes to uniquely identify each version.
Data is stored locally or on remote storage (S3, GCS, etc.); only small .dvc metadata files are committed to Git.
DVC pipelines define ML workflows as DAGs (directed acyclic graphs) of steps with dependencies.
Running dvc repro executes only the steps whose inputs changed, saving time and ensuring reproducibility.
Combined with Git tags and MLflow, DVC gives complete reproducibility: data version, code version, and model all tracked together.

Frequently Asked Questions

Can I use DVC with existing large files already in my repository?

Yes, but first remove them from Git to save space. Then add them with DVC. Use dvc mv to reorganize or dvc add to begin tracking existing files.

Does DVC require a remote backend?

No, it works locally out of the box. Remote storage is optional and recommended for teams. For a solo project, local storage in .dvc/ is sufficient.

How does DVC handle branching and merging in Git?

When you switch Git branches, check out the corresponding data with dvc pull. If two branches modify the same data file, merge the .dvc files like any other text file and resolve conflicts in the MD5 hashes. DVC provides tooling to help (dvc fetch, dvc checkout).

Can I version models with DVC?

Yes. Add model files with dvc add model.pkl or define them as outputs in dvc.yaml pipelines. This is less common than versioning data, since MLflow handles model versioning, but it is possible.

What if my dataset is too large to push to remote storage frequently?

Commit only the .dvc metadata to Git, which is tiny. Push data to remote storage only when stable or when collaborators need it. DVC supports selective pushes: dvc push <stage_name> pushes only specific outputs.

Why Data Versioning Matters​

Installing and Initializing DVC​

Versioning Data Files​

Configuring Remote Storage​

Versioning with Git Tags​

Creating Reproducible Pipelines with DVC​

Python Scripts in a DVC Pipeline​

Comparing Pipeline Experiments​

Integrating DVC with MLflow​

Key Takeaways​

Frequently Asked Questions​

Can I use DVC with existing large files already in my repository?​

Does DVC require a remote backend?​

How does DVC handle branching and merging in Git?​

Can I version models with DVC?​

What if my dataset is too large to push to remote storage frequently?​

Further Reading​