Skip to main content

Production-Ready Python IaC: Scaling Infrastructure

Deploying a single EC2 instance with Pulumi or Boto3 is straightforward. Scaling infrastructure to manage 100+ resources across multiple regions, handling disasters, optimizing costs, and onboarding team members requires architecture, governance, and discipline. This article covers production patterns: modular design, monitoring, disaster recovery, cost controls, and secrets at scale.

Modular Infrastructure Architecture

As your infrastructure grows, monolithic stacks become unmaintainable. Decompose your infrastructure into logical modules (data, compute, networking, security) managed by separate stacks or Pulumi projects.

Recommended structure:

infrastructure/
├── 01-networking/ (VPCs, subnets, NAT, routing)
├── 02-security/ (IAM roles, security groups, KMS keys)
├── 03-data/ (RDS, ElastiCache, S3)
├── 04-compute/ (ECS, Lambda, Auto Scaling Groups)
└── 05-observability/ (CloudWatch, DataDog, monitoring)

Each folder is a Pulumi project with its own stack (dev, staging, prod). Stack references link them:

# In 04-compute/__main__.py
data_stack = pulumi.StackReference(f"organization/03-data/{pulumi.get_stack()}")
db_endpoint = data_stack.get_output('rds_endpoint')
cache_endpoint = data_stack.get_output('redis_endpoint')

# Create application resources using database endpoints

Benefits:

  • Teams own specific stacks (networking team owns 01, data team owns 03).
  • Deploying one module doesn't risk others.
  • Easier to version and test changes.
  • Clear separation of concerns.

Monitoring and Observability

Infrastructure without monitoring is like flying blind. Wire up CloudWatch, DataDog, or equivalent before production.

CloudWatch logs and metrics:

import pulumi
import pulumi_aws as aws
import json

# Create CloudWatch log group for application
app_log_group = aws.cloudwatch.LogGroup('app-logs',
name='/aws/app/prod',
retention_in_days=30
)

# Create CloudWatch alarm for high CPU
cpu_alarm = aws.cloudwatch.MetricAlarm('high-cpu',
name='app-prod-high-cpu',
comparison_operator='GreaterThanThreshold',
evaluation_periods=2,
metric_name='CPUUtilization',
namespace='AWS/EC2',
period=300,
statistic='Average',
threshold=80,
dimensions={'InstanceId': instance.id},
alarm_actions=[sns_topic_arn]
)

# Create custom metric for business logic
dimension = aws.cloudwatch.EventTargetArgs(
arn=lambda_function.arn,
role_arn=role.arn
)

DataDog integration:

import pulumi
import pulumi_datadog as datadog

# Create monitor for application errors
error_monitor = datadog.Monitor('app-errors',
type='metric alert',
message='Alert when error rate exceeds 5%: @pagerduty-team',
name='High error rate in production',
query='avg(last_5m){service:app-prod,status:error} > 0.05'
)

# Tag all resources for DataDog organization
environment = pulumi.get_stack()
pulumi.export('datadog_tags', [f'environment:{environment}', 'service:app', 'team:platform'])

Set up log aggregation (CloudWatch Logs Insights, DataDog, or Splunk) to centralize logs from all services. Query logs to diagnose outages in seconds rather than hours.

Disaster Recovery and High Availability

Production infrastructure must survive failures. Design for redundancy.

Across availability zones:

import pulumi
import pulumi_aws as aws

# Create RDS in multi-AZ (automatic failover)
db = aws.rds.Instance('prod-db',
allocated_storage=100,
engine='postgres',
multi_az=True, # Synchronous replica in another AZ
backup_retention_period=30,
backup_window='03:00-04:00',
skip_final_snapshot=False,
final_snapshot_identifier=f'prod-db-final-{datetime.now().isoformat()}',
enable_cloudwatch_logs_exports=['postgresql']
)

# Auto Scaling Group across 3 AZs
asg = aws.autoscaling.Group('app-asg',
availability_zones=['us-east-1a', 'us-east-1b', 'us-east-1c'],
desired_capacity=3,
min_size=3,
max_size=10,
launch_configuration_name=lc.name,
health_check_type='ELB',
health_check_grace_period=300
)

Across regions:

# Deploy to primary region
primary_region = 'us-east-1'
secondary_region = 'us-west-2'

for region in [primary_region, secondary_region]:
cluster = aws.ecs.Cluster(f'app-cluster-{region}',
name=f'app-prod-{region}',
settings=[aws.ecs.ClusterSettingArgs(
name='containerInsights',
value='enabled'
)]
)

# Database replication from primary to secondary
db_replica = aws.rds.Instance('prod-db-replica',
replicate_source_db='prod-db-primary',
skip_final_snapshot=True,
auto_minor_version_upgrade=False
)

# Use Route53 to failover across regions
failover_record = aws.route53.Record('app-dns',
zone_id=zone.id,
name='app.example.com',
type='A',
set_identifier='primary',
failover_routing_policy={'type': 'PRIMARY'},
aliases=[aws.route53.AliasArgs(
name=primary_alb.dns_name,
zone_id=primary_alb.zone_id,
evaluate_target_health=True
)]
)

Test failover at least quarterly. Run chaos engineering exercises (Netflix Chaos Monkey) to inject failures and verify recovery.

Cost Optimization

Cloud costs grow invisible unless you actively manage them. Implement cost controls.

Reserved instances for predictable workloads:

import pulumi
import pulumi_aws as aws

# On-demand instances for bursty workloads
on_demand = aws.ec2.Instance('app-burst',
instance_type='t3.medium', # Burstable; cheap but throttled under high load
ami=ami_id
)

# Reserved instances for baseline capacity (33% savings)
# Purchase via console or AWS Savings Plans; Pulumi tracks them
baseline_instances = 10
savings_potential = baseline_instances * 730 * 0.0416 * 0.67 # t3.medium, on-demand, 67% discount
pulumi.export('annual_savings_with_reservations', f'${savings_potential:.0f}')

Auto-scaling based on demand:

# Scale down at night, weekends
scaling_schedule_down = aws.autoscaling.Schedule('scale-down-night',
auto_scaling_group_name=asg.name,
min_size=1,
max_size=5,
desired_capacity=1,
recurrence='0 22 * * *' # 10 PM daily
)

scaling_schedule_up = aws.autoscaling.Schedule('scale-up-morning',
auto_scaling_group_name=asg.name,
min_size=3,
max_size=10,
desired_capacity=3,
recurrence='0 8 * * 1-5' # 8 AM, weekdays
)

Identify and remove unused resources:

# Boto3 script to find idle resources
import boto3
import json
from datetime import datetime, timedelta

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')

def find_idle_instances():
"""Find EC2 instances with near-zero CPU for 7 days."""
response = ec2.describe_instances(
Filters=[
{'Name': 'instance-state-name', 'Values': ['running']}
]
)

idle_instances = []

for reservation in response['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']

# Check CloudWatch CPU metrics for 7 days
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.now() - timedelta(days=7),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average']
)

if metrics['Datapoints']:
avg_cpu = sum(dp['Average'] for dp in metrics['Datapoints']) / len(metrics['Datapoints'])
if avg_cpu < 5: # Less than 5% CPU on average
idle_instances.append({
'id': instance_id,
'name': next(
(t['Value'] for t in instance.get('Tags', []) if t['Key'] == 'Name'),
'N/A'
),
'avg_cpu': avg_cpu
})

return idle_instances

idle = find_idle_instances()
if idle:
print("Idle instances (candidate for termination):")
for inst in idle:
print(f" {inst['id']} ({inst['name']}): {inst['avg_cpu']:.1f}% CPU")

Run this monthly and terminate idle instances or downsize them.

Tagging and Chargeback

Tags are your cost control lever. Tag every resource with cost center, environment, and owner.

tags = {
'Environment': pulumi.get_stack(),
'CostCenter': 'engineering',
'Owner': 'platform-team',
'ManagedBy': 'pulumi',
'Project': 'app-platform',
'CreatedAt': datetime.now().isoformat()
}

instance = aws.ec2.Instance('app-server',
ami=ami_id,
instance_type='t3.medium',
tags=tags
)

# Use Cost Explorer to attribute costs to tags
# AWS Cost Explorer > Filter by tag > CostCenter

Set up monthly billing alerts by tag:

# Terraform / Pulumi config to enable Cost Anomaly Detection
pulumi config set budget_alert_email [email protected]
pulumi config set cost_threshold 5000 # Alert if monthly spend exceeds $5k

Infrastructure Testing and Validation

Test infrastructure before deploying to production.

Unit tests for Pulumi stacks:

# test_infrastructure.py
import unittest
import pulumi

class TestInfrastructure(unittest.TestCase):
def test_vpc_created(self):
"""Verify VPC is created with correct CIDR."""
# Run Pulumi preview and parse JSON output
import json
import subprocess

result = subprocess.run(
['pulumi', 'preview', '--json'],
capture_output=True,
text=True
)

preview = json.loads(result.stdout)

# Check that VPC resource is in the preview
vpc_resources = [r for r in preview['resourceChanges']
if r['resourceType'] == 'aws:ec2/vpc:Vpc']

self.assertGreater(len(vpc_resources), 0, "VPC should be created")

self.assertEqual(vpc_resources[0]['resource']['cidrBlock'], '10.0.0.0/16')

if __name__ == '__main__':
unittest.main()

Integration tests with moto (AWS mocking):

import boto3
import pytest
from moto import mock_ec2

@mock_ec2
def test_instance_creation():
"""Test EC2 instance creation with moto mock."""
client = boto3.client('ec2', region_name='us-east-1')

# Simulate infrastructure code
response = client.run_instances(ImageId='ami-12345678', MinCount=1, MaxCount=1)

instance_id = response['Instances'][0]['InstanceId']

# Verify instance is created
instances = client.describe_instances(InstanceIds=[instance_id])
assert len(instances['Reservations'][0]['Instances']) == 1

Onboarding New Team Members

As teams grow, provide runbooks and automation for new members.

Setup script for developers:

#!/bin/bash
# setup-dev-env.sh

set -e

echo "Setting up development environment..."

# Install Pulumi CLI
curl -fsSL https://get.pulumi.com | sh

# Install Python and dependencies
python -m venv env
source env/bin/activate
pip install pulumi pulumi-aws boto3

# Authenticate with Pulumi
read -p "Enter your Pulumi access token: " PULUMI_TOKEN
pulumi login --cloud-url https://app.pulumi.com --access-token $PULUMI_TOKEN

# Set up AWS credentials
aws configure

# Clone infrastructure repo and bootstrap dev stack
git clone https://github.com/myorg/infrastructure.git
cd infrastructure
pulumi stack select dev
pulumi stack create dev # If not already created
pulumi up

echo "Setup complete! Your dev environment is ready."
echo "Run 'pulumi stack select dev && pulumi up' to deploy changes."

Provide documentation: runbooks for common tasks (scaling, failover, secret rotation), architecture diagrams, and on-call guides.

Key Takeaways

  • Decompose infrastructure into modular stacks owned by teams; use stack references to link them.
  • Monitor everything: logs, metrics, alarms. Instrument infrastructure at deploy time.
  • Design for disaster: multi-AZ databases, cross-region failover, and tested recovery procedures.
  • Optimize costs: use reserved instances, auto-scaling, and tagging. Find and terminate idle resources.
  • Test infrastructure: unit tests for Pulumi code, integration tests with moto, and chaos engineering.
  • Onboard team members with automated setup scripts and comprehensive runbooks.

Frequently Asked Questions

How many resources should be in a single Pulumi stack?

No hard limit, but 50-100 resources is a reasonable upper bound. Beyond that, consider splitting into multiple stacks to avoid slow previews and deployments.

How do we handle infrastructure changes with zero downtime?

Use blue-green deployments: deploy new infrastructure alongside old, test it, then switch traffic. Pulumi supports this via stack outputs and custom logic.

Who approves infrastructure changes in production?

Require at least one other engineer to review and approve. Use GitHub pull requests for code review, and CI/CD approval gates for deployments.

What's the best approach for disaster recovery testing?

Run monthly drills: failover to secondary region, restore from backup, or tear down and redeploy from scratch. Time how long recovery takes; target sub-5-minute RTO (Recovery Time Objective).

How do we manage drift (cloud resources changed outside IaC)?

Run pulumi refresh weekly to reconcile state with actual cloud state. Drift usually indicates manual troubleshooting; investigate and reapply via code.

Further Reading