MLOps on Cloud: AWS SageMaker and GCP Vertex AI

Cloud platforms offer managed MLOps services that handle infrastructure, scaling, and integration with their ecosystems. Instead of managing MLflow servers, databases, and storage yourself, AWS SageMaker and GCP Vertex AI provide end-to-end ML platforms. In this article, you will learn to build ML pipelines on these platforms, integrating with your Python code.

AWS SageMaker: Managed ML on AWS

AWS SageMaker is a fully managed ML platform providing experiment tracking, model registry, training at scale, and deployment. It integrates with other AWS services (S3, IAM, CloudWatch, Lambda).

Training on SageMaker: Basic Example

Use SageMaker's Python SDK to train a model on managed infrastructure:

import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

# Setup session and IAM role
session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()

# Define training job parameters
estimator = Estimator(
    image_uri="382416733822.dkr.ecr.us-east-1.amazonaws.com/image_uri",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{bucket}/output",
)

# Configure hyperparameters
estimator.set_hyperparameters(
    epochs=10,
    learning_rate=0.001,
    batch_size=32,
)

# Train on managed infrastructure
training_data = f"s3://{bucket}/data/training/"
estimator.fit({"training": training_data})

print(f"Training job completed. Model: {estimator.model_data}")

SageMaker handles the infrastructure: spins up an EC2 instance, runs your training code, and saves the model to S3. You pay only for the compute used.

SageMaker Pipelines for Workflows

Build end-to-end ML workflows using SageMaker Pipelines:

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import TrainingStep, ProcessingStep, CreateModelStep
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator

# Step 1: Process data
script_processor = ScriptProcessor(
    role=role,
    image_uri="246618743249.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
    instance_count=1,
    instance_type="ml.m5.xlarge",
)

processing_step = ProcessingStep(
    name="ProcessingStep",
    processor=script_processor,
    code="scripts/preprocess.py",
    job_arguments=["--input", "s3://input/", "--output", "s3://processed/"],
)

# Step 2: Train model
estimator = Estimator(
    image_uri="...",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path="s3://models/",
)

training_step = TrainingStep(
    name="TrainingStep",
    estimator=estimator,
    inputs={"training": "s3://processed/"},
)

# Step 3: Create model
create_model_step = CreateModelStep(
    name="CreateModelStep",
    model=estimator.create_model(),
)

# Combine into pipeline
pipeline = Pipeline(
    name="ML-Pipeline",
    parameters=[],
    steps=[processing_step, training_step, create_model_step],
)

# Execute pipeline
pipeline.upsert(role_arn=role)
execution = pipeline.start()

# Monitor execution
execution.wait()
print(f"Pipeline completed!")

Tracking Experiments with SageMaker Experiments

SageMaker integrates experiment tracking:

from sagemaker.experiments.run import Run

# Log an experiment
with Run(
    experiment_name="iris_classification",
    run_name="rf_v1",
) as run:
    run.log_parameter("n_estimators", 100)
    run.log_parameter("max_depth", 10)
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
    
    accuracy = accuracy_score(y_test, model.predict(X_test))
    run.log_metric("accuracy", accuracy)

# Later, query experiments
import pandas as pd
from sagemaker.experiments.analytics import ExperimentAnalytics

analytics = ExperimentAnalytics(experiment_name="iris_classification")
df = analytics.dataframe()
print(df[["ExperimentName", "RunName", "accuracy"]].sort_values("accuracy", ascending=False))

GCP Vertex AI: Google's MLOps Platform

Google Cloud's Vertex AI is similar to SageMaker but deeply integrated with Google's ecosystem (BigQuery, GCS, AI Platform).

Training on Vertex AI

from google.cloud import aiplatform

# Initialize
aiplatform.init(project="my-project", location="us-central1")

# Define a custom training job
job = aiplatform.CustomTrainingJob(
    display_name="iris-classification-training",
    script_path="train.py",
    container_uri="gcr.io/cloud-aiml/training/tf-gpu.2-13:latest",
    requirements=["scikit-learn==1.5.0", "pandas==2.2.0"],
)

# Run training
model = job.run(
    replica_count=1,
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_K80",
    accelerator_count=1,
)

print(f"Model trained: {model.resource_name}")

Vertex AI Pipelines

Define ML workflows using Kubeflow Pipeline syntax:

from kfp import dsl
from kfp.v2.dsl import component, Artifact, Model

@component(base_image="python:3.11")
def preprocess_data(
    input_path: str,
    output_path: str,
):
    """Preprocess data."""
    import pandas as pd
    
    df = pd.read_csv(input_path)
    df = df.dropna()
    df.to_csv(output_path, index=False)
    print(f"Data preprocessed: {output_path}")

@component(base_image="python:3.11")
def train_model(
    training_data: str,
    model_output: Model,
):
    """Train a model."""
    import pickle
    from sklearn.ensemble import RandomForestClassifier
    
    df = pd.read_csv(training_data)
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]
    
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)
    
    with open(model_output.path, "wb") as f:
        pickle.dump(model, f)

@dsl.pipeline(
    name="iris-pipeline",
    description="Iris classification pipeline",
)
def iris_pipeline(input_path: str = "gs://bucket/input.csv"):
    preprocess_task = preprocess_data(
        input_path=input_path,
        output_path="gs://bucket/processed.csv",
    )
    
    train_task = train_model(
        training_data=preprocess_task.outputs["output_path"],
    )

# Compile and run
from kfp.v2 import compiler

compiler.Compiler().compile(
    pipeline_func=iris_pipeline,
    package_path="iris_pipeline.yaml"
)

# Submit to Vertex AI
from google.cloud.aiplatform import pipeline_jobs

job = pipeline_jobs.PipelineJob(
    display_name="iris-pipeline-run",
    template_path="iris_pipeline.yaml",
    pipeline_root="gs://bucket/pipeline-root/",
    project="my-project",
    location="us-central1",
)

job.run()

Vertex AI Model Registry

from google.cloud import aiplatform

# Upload model to registry
model = aiplatform.Model.upload(
    display_name="iris-classifier",
    artifact_uri="gs://bucket/model.pkl",
    serving_container_image_uri="gcr.io/cloud-aiml/prediction/sklearn-cpu.1-1:latest",
)

# Deploy to an endpoint
endpoint = model.deploy(
    machine_type="n1-standard-2",
    replica_count=1,
)

# Get predictions
predictions = endpoint.predict(instances=[{"feature": 5.1}])
print(predictions)

# Undeploy
endpoint.undeploy_all()

Comparing SageMaker vs. Vertex AI

Feature	SageMaker	Vertex AI
Experiment tracking	SageMaker Experiments	Vertex AI Experiments
Model registry	SageMaker Model Registry	Vertex AI Model Registry
Pipelines	SageMaker Pipelines (proprietary)	Vertex AI Pipelines (Kubeflow-based)
Integration	Deep AWS integration (S3, Lambda, CloudWatch)	Deep GCP integration (BigQuery, GCS, Pub/Sub)
Community	Large AWS ML community	Growing GCP ML community
Cost	Pay per compute hour + storage	Pay per compute hour + storage
Ease of use	SageMaker console is intuitive	Vertex AI console is feature-rich

For AWS-heavy organizations, SageMaker is natural. For GCP users, Vertex AI is seamless.

Running Python Scripts on Managed Platforms

Both platforms let you run custom Python training scripts. Package your code:

# Directory structure
project/
  train.py
  requirements.txt
  preprocessing.py
  evaluation.py

Define requirements.txt:

scikit-learn==1.5.0
pandas==2.2.0
mlflow==2.12.0

In train.py, accept hyperparameters as arguments:

# train.py
import argparse
import json
import mlflow
import pickle
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

parser = argparse.ArgumentParser()
parser.add_argument("--n_estimators", type=int, default=100)
parser.add_argument("--max_depth", type=int, default=10)
parser.add_argument("--train_data", type=str, required=True)
parser.add_argument("--model_output", type=str, required=True)

args = parser.parse_args()

# Load and train
df = pd.read_csv(args.train_data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

model = RandomForestClassifier(
    n_estimators=args.n_estimators,
    max_depth=args.max_depth,
)
model.fit(X, y)

# Save model
with open(args.model_output, "wb") as f:
    pickle.dump(model, f)

print(f"Model saved to {args.model_output}")

SageMaker and Vertex AI will execute this script on managed infrastructure, passing --train_data and --model_output from your pipeline definition.

Key Takeaways

AWS SageMaker and GCP Vertex AI are fully managed MLOps platforms with experiment tracking, model registries, and pipeline orchestration.
Both handle infrastructure (scaling, resource provisioning) so you focus on model development.
SageMaker integrates deeply with AWS (S3, Lambda, CloudWatch). Vertex AI integrates with GCP (BigQuery, Pub/Sub).
Pipelines define multi-step workflows (preprocess -> train -> evaluate -> deploy) as code.
Both platforms support custom Python training code packaged with dependencies.

Frequently Asked Questions

Should I use SageMaker/Vertex AI or self-hosted MLflow?

Use managed platforms if: you are on AWS/GCP already, need strong scaling and integration, and can afford cloud costs. Use self-hosted MLflow if: you need maximum flexibility, want to avoid cloud lock-in, or operate on-premise.

Can I integrate SageMaker with my MLflow setup?

Partially. You can use MLflow for experiment tracking locally and push models to SageMaker for deployment. Full integration is not seamless; choose one platform end-to-end for simplicity.

How do I handle large datasets on these platforms?

Both SageMaker and Vertex AI read data from S3 (AWS) or GCS (GCP) natively. Your training script downloads only the batches it needs. For very large datasets, use data preprocessing pipelines (SageMaker Processing, Vertex AI Pipelines) to prepare data in-place.

Can I run these services locally for development?

SageMaker offers a local mode for development. Vertex AI does not. Use local MLflow or Docker-based setups for development, then scale to the cloud when ready.

What are typical costs?

SageMaker training: ~$0.50/hour for m5.xlarge. Vertex AI: similar. If you train 5 models a day for 1 hour each, expect $75-150/month in compute. Storage (S3, GCS) adds to this. Use spot instances to cut costs 70%.

AWS SageMaker: Managed ML on AWS​

Training on SageMaker: Basic Example​

SageMaker Pipelines for Workflows​

Tracking Experiments with SageMaker Experiments​

GCP Vertex AI: Google's MLOps Platform​

Training on Vertex AI​

Vertex AI Pipelines​

Vertex AI Model Registry​

Comparing SageMaker vs. Vertex AI​

Running Python Scripts on Managed Platforms​

Key Takeaways​

Frequently Asked Questions​

Should I use SageMaker/Vertex AI or self-hosted MLflow?​

Can I integrate SageMaker with my MLflow setup?​

How do I handle large datasets on these platforms?​

Can I run these services locally for development?​

What are typical costs?​

Further Reading​