MLOps on Cloud: AWS SageMaker and GCP Vertex AI
Cloud platforms offer managed MLOps services that handle infrastructure, scaling, and integration with their ecosystems. Instead of managing MLflow servers, databases, and storage yourself, AWS SageMaker and GCP Vertex AI provide end-to-end ML platforms. In this article, you will learn to build ML pipelines on these platforms, integrating with your Python code.
AWS SageMaker: Managed ML on AWS
AWS SageMaker is a fully managed ML platform providing experiment tracking, model registry, training at scale, and deployment. It integrates with other AWS services (S3, IAM, CloudWatch, Lambda).
Training on SageMaker: Basic Example
Use SageMaker's Python SDK to train a model on managed infrastructure:
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
# Setup session and IAM role
session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
# Define training job parameters
estimator = Estimator(
image_uri="382416733822.dkr.ecr.us-east-1.amazonaws.com/image_uri",
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
output_path=f"s3://{bucket}/output",
)
# Configure hyperparameters
estimator.set_hyperparameters(
epochs=10,
learning_rate=0.001,
batch_size=32,
)
# Train on managed infrastructure
training_data = f"s3://{bucket}/data/training/"
estimator.fit({"training": training_data})
print(f"Training job completed. Model: {estimator.model_data}")
SageMaker handles the infrastructure: spins up an EC2 instance, runs your training code, and saves the model to S3. You pay only for the compute used.
SageMaker Pipelines for Workflows
Build end-to-end ML workflows using SageMaker Pipelines:
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import TrainingStep, ProcessingStep, CreateModelStep
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
# Step 1: Process data
script_processor = ScriptProcessor(
role=role,
image_uri="246618743249.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
instance_count=1,
instance_type="ml.m5.xlarge",
)
processing_step = ProcessingStep(
name="ProcessingStep",
processor=script_processor,
code="scripts/preprocess.py",
job_arguments=["--input", "s3://input/", "--output", "s3://processed/"],
)
# Step 2: Train model
estimator = Estimator(
image_uri="...",
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
output_path="s3://models/",
)
training_step = TrainingStep(
name="TrainingStep",
estimator=estimator,
inputs={"training": "s3://processed/"},
)
# Step 3: Create model
create_model_step = CreateModelStep(
name="CreateModelStep",
model=estimator.create_model(),
)
# Combine into pipeline
pipeline = Pipeline(
name="ML-Pipeline",
parameters=[],
steps=[processing_step, training_step, create_model_step],
)
# Execute pipeline
pipeline.upsert(role_arn=role)
execution = pipeline.start()
# Monitor execution
execution.wait()
print(f"Pipeline completed!")
Tracking Experiments with SageMaker Experiments
SageMaker integrates experiment tracking:
from sagemaker.experiments.run import Run
# Log an experiment
with Run(
experiment_name="iris_classification",
run_name="rf_v1",
) as run:
run.log_parameter("n_estimators", 100)
run.log_parameter("max_depth", 10)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
run.log_metric("accuracy", accuracy)
# Later, query experiments
import pandas as pd
from sagemaker.experiments.analytics import ExperimentAnalytics
analytics = ExperimentAnalytics(experiment_name="iris_classification")
df = analytics.dataframe()
print(df[["ExperimentName", "RunName", "accuracy"]].sort_values("accuracy", ascending=False))
GCP Vertex AI: Google's MLOps Platform
Google Cloud's Vertex AI is similar to SageMaker but deeply integrated with Google's ecosystem (BigQuery, GCS, AI Platform).
Training on Vertex AI
from google.cloud import aiplatform
# Initialize
aiplatform.init(project="my-project", location="us-central1")
# Define a custom training job
job = aiplatform.CustomTrainingJob(
display_name="iris-classification-training",
script_path="train.py",
container_uri="gcr.io/cloud-aiml/training/tf-gpu.2-13:latest",
requirements=["scikit-learn==1.5.0", "pandas==2.2.0"],
)
# Run training
model = job.run(
replica_count=1,
machine_type="n1-standard-4",
accelerator_type="NVIDIA_TESLA_K80",
accelerator_count=1,
)
print(f"Model trained: {model.resource_name}")
Vertex AI Pipelines
Define ML workflows using Kubeflow Pipeline syntax:
from kfp import dsl
from kfp.v2.dsl import component, Artifact, Model
@component(base_image="python:3.11")
def preprocess_data(
input_path: str,
output_path: str,
):
"""Preprocess data."""
import pandas as pd
df = pd.read_csv(input_path)
df = df.dropna()
df.to_csv(output_path, index=False)
print(f"Data preprocessed: {output_path}")
@component(base_image="python:3.11")
def train_model(
training_data: str,
model_output: Model,
):
"""Train a model."""
import pickle
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(training_data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
with open(model_output.path, "wb") as f:
pickle.dump(model, f)
@dsl.pipeline(
name="iris-pipeline",
description="Iris classification pipeline",
)
def iris_pipeline(input_path: str = "gs://bucket/input.csv"):
preprocess_task = preprocess_data(
input_path=input_path,
output_path="gs://bucket/processed.csv",
)
train_task = train_model(
training_data=preprocess_task.outputs["output_path"],
)
# Compile and run
from kfp.v2 import compiler
compiler.Compiler().compile(
pipeline_func=iris_pipeline,
package_path="iris_pipeline.yaml"
)
# Submit to Vertex AI
from google.cloud.aiplatform import pipeline_jobs
job = pipeline_jobs.PipelineJob(
display_name="iris-pipeline-run",
template_path="iris_pipeline.yaml",
pipeline_root="gs://bucket/pipeline-root/",
project="my-project",
location="us-central1",
)
job.run()
Vertex AI Model Registry
Register models to Vertex AI's registry:
from google.cloud import aiplatform
# Upload model to registry
model = aiplatform.Model.upload(
display_name="iris-classifier",
artifact_uri="gs://bucket/model.pkl",
serving_container_image_uri="gcr.io/cloud-aiml/prediction/sklearn-cpu.1-1:latest",
)
# Deploy to an endpoint
endpoint = model.deploy(
machine_type="n1-standard-2",
replica_count=1,
)
# Get predictions
predictions = endpoint.predict(instances=[{"feature": 5.1}])
print(predictions)
# Undeploy
endpoint.undeploy_all()
Comparing SageMaker vs. Vertex AI
| Feature | SageMaker | Vertex AI |
|---|---|---|
| Experiment tracking | SageMaker Experiments | Vertex AI Experiments |
| Model registry | SageMaker Model Registry | Vertex AI Model Registry |
| Pipelines | SageMaker Pipelines (proprietary) | Vertex AI Pipelines (Kubeflow-based) |
| Integration | Deep AWS integration (S3, Lambda, CloudWatch) | Deep GCP integration (BigQuery, GCS, Pub/Sub) |
| Community | Large AWS ML community | Growing GCP ML community |
| Cost | Pay per compute hour + storage | Pay per compute hour + storage |
| Ease of use | SageMaker console is intuitive | Vertex AI console is feature-rich |
For AWS-heavy organizations, SageMaker is natural. For GCP users, Vertex AI is seamless.
Running Python Scripts on Managed Platforms
Both platforms let you run custom Python training scripts. Package your code:
# Directory structure
project/
train.py
requirements.txt
preprocessing.py
evaluation.py
Define requirements.txt:
scikit-learn==1.5.0
pandas==2.2.0
mlflow==2.12.0
In train.py, accept hyperparameters as arguments:
# train.py
import argparse
import json
import mlflow
import pickle
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument("--n_estimators", type=int, default=100)
parser.add_argument("--max_depth", type=int, default=10)
parser.add_argument("--train_data", type=str, required=True)
parser.add_argument("--model_output", type=str, required=True)
args = parser.parse_args()
# Load and train
df = pd.read_csv(args.train_data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
model = RandomForestClassifier(
n_estimators=args.n_estimators,
max_depth=args.max_depth,
)
model.fit(X, y)
# Save model
with open(args.model_output, "wb") as f:
pickle.dump(model, f)
print(f"Model saved to {args.model_output}")
SageMaker and Vertex AI will execute this script on managed infrastructure, passing --train_data and --model_output from your pipeline definition.
Key Takeaways
- AWS SageMaker and GCP Vertex AI are fully managed MLOps platforms with experiment tracking, model registries, and pipeline orchestration.
- Both handle infrastructure (scaling, resource provisioning) so you focus on model development.
- SageMaker integrates deeply with AWS (S3, Lambda, CloudWatch). Vertex AI integrates with GCP (BigQuery, Pub/Sub).
- Pipelines define multi-step workflows (preprocess -> train -> evaluate -> deploy) as code.
- Both platforms support custom Python training code packaged with dependencies.
Frequently Asked Questions
Should I use SageMaker/Vertex AI or self-hosted MLflow?
Use managed platforms if: you are on AWS/GCP already, need strong scaling and integration, and can afford cloud costs. Use self-hosted MLflow if: you need maximum flexibility, want to avoid cloud lock-in, or operate on-premise.
Can I integrate SageMaker with my MLflow setup?
Partially. You can use MLflow for experiment tracking locally and push models to SageMaker for deployment. Full integration is not seamless; choose one platform end-to-end for simplicity.
How do I handle large datasets on these platforms?
Both SageMaker and Vertex AI read data from S3 (AWS) or GCS (GCP) natively. Your training script downloads only the batches it needs. For very large datasets, use data preprocessing pipelines (SageMaker Processing, Vertex AI Pipelines) to prepare data in-place.
Can I run these services locally for development?
SageMaker offers a local mode for development. Vertex AI does not. Use local MLflow or Docker-based setups for development, then scale to the cloud when ready.
What are typical costs?
SageMaker training: ~$0.50/hour for m5.xlarge. Vertex AI: similar. If you train 5 models a day for 1 hour each, expect $75-150/month in compute. Storage (S3, GCS) adds to this. Use spot instances to cut costs 70%.
Further Reading
- AWS SageMaker Documentation — Comprehensive guide.
- GCP Vertex AI Documentation — Official Vertex AI reference.
- SageMaker Pipelines Best Practices — Production patterns.
- Vertex AI Pipelines with Kubeflow — Pipeline construction guide.