Skip to main content

Error Handling and Resilience in Web Scrapers

Scraping the internet is inherently unreliable. Networks fail, servers go down, pages change, and requests timeout. A production scraper must handle these failures gracefully, retry intelligently, log problems, and recover without human intervention. This article covers retry logic with exponential backoff, circuit breakers to detect broken sites, comprehensive logging for debugging, and patterns for resuming interrupted scrapes. You will build a scraper that stays operational through transient failures and alerts you to permanent problems. By the end, your scraper will be robust enough for unsupervised operation.

I learned error handling the hard way. Early in my scraping career, a scraper crashed at 99.5% completion due to a single unhandled exception. Since then, I wrap every external call in try-catch, implement circuit breakers, and log extensively. These investments have saved me countless hours of debugging.

Comprehensive Try-Catch and Exception Handling

Always anticipate and handle exceptions explicitly. Here is a robust pattern:

import requests
from bs4 import BeautifulSoup
import logging
import time

# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
handlers=[
logging.FileHandler("scraper.log"),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)

def fetch_and_parse(url):
"""Fetch and parse a URL with comprehensive exception handling."""

try:
logger.info(f"Fetching {url}")

# Timeout after 15 seconds
response = requests.get(url, timeout=15)

# Check status code
if response.status_code != 200:
logger.warning(f"HTTP {response.status_code} for {url}")
response.raise_for_status()

# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")
logger.info(f"Successfully parsed {url}")
return soup

except requests.exceptions.Timeout:
logger.error(f"Request timed out: {url}")
return None

except requests.exceptions.ConnectionError:
logger.error(f"Connection error: {url}")
return None

except requests.exceptions.HTTPError as e:
logger.error(f"HTTP error {response.status_code}: {url}")
return None

except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {url} - {e}")
return None

except Exception as e:
logger.exception(f"Unexpected error for {url}: {e}")
return None

# Usage
soup = fetch_and_parse("https://example.com")
if soup:
# Process soup
pass
else:
logger.warning("Could not fetch page")

Exception hierarchy (most to least specific):

  1. requests.exceptions.Timeout — request exceeded timeout
  2. requests.exceptions.ConnectionError — network unreachable
  3. requests.exceptions.HTTPError — bad status code (4xx, 5xx)
  4. requests.exceptions.RequestException — all requests errors
  5. BeautifulSoup exceptions (less common)
  6. Generic Exception — catch-all

Retry Logic with Exponential Backoff

Transient errors (timeouts, 503 Service Unavailable) often resolve with retry. Use exponential backoff:

import requests
import time
import random
import logging

logger = logging.getLogger(__name__)

def fetch_with_retry(url, max_retries=5, base_delay=1):
"""
Fetch a URL with exponential backoff retries.

base_delay: Initial delay in seconds (grows as 2^attempt).
max_retries: Maximum retry attempts.
"""

for attempt in range(max_retries):
try:
logger.info(f"Attempt {attempt + 1}/{max_retries}: {url}")
response = requests.get(url, timeout=10)
response.raise_for_status()
logger.info(f"Success: {url}")
return response

except requests.exceptions.Timeout:
logger.warning(f"Timeout on attempt {attempt + 1}")

except requests.exceptions.ConnectionError:
logger.warning(f"Connection error on attempt {attempt + 1}")

except requests.exceptions.HTTPError as e:
if response.status_code in [503, 504, 429]:
# Temporary server errors; retry
logger.warning(f"HTTP {response.status_code}; will retry")
else:
# Permanent errors (404, 403); give up
logger.error(f"HTTP {response.status_code}; giving up")
return None

except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {e}")
return None

# Calculate backoff
if attempt < max_retries - 1:
# Exponential backoff with jitter
wait_time = (base_delay * (2 ** attempt)) + random.uniform(0, 1)
logger.info(f"Retrying in {wait_time:.1f} seconds...")
time.sleep(wait_time)

logger.error(f"Failed after {max_retries} attempts: {url}")
return None

# Usage
response = fetch_with_retry("https://example.com", max_retries=3)

Exponential backoff schedule (base_delay=1):

AttemptWait time (seconds)
11 + jitter
22 + jitter
34 + jitter
48 + jitter
516 + jitter

This prevents hammering a slow/recovering server and avoids getting blocked.

Circuit Breaker Pattern

A circuit breaker stops retrying when a site is persistently broken, preventing wasted requests:

import time
import logging
from enum import Enum

logger = logging.getLogger(__name__)

class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Too many failures; reject requests
HALF_OPEN = "half_open" # Testing if the service recovered

class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
"""
failure_threshold: Number of failures before opening circuit.
reset_timeout: Seconds to wait before trying again.
"""
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED

def record_success(self):
"""Record a successful request."""
if self.state == CircuitState.HALF_OPEN:
logger.info("Circuit breaker CLOSED (service recovered)")
self.state = CircuitState.CLOSED

self.failure_count = 0

def record_failure(self):
"""Record a failed request."""
self.failure_count += 1
self.last_failure_time = time.time()

if self.failure_count >= self.failure_threshold:
logger.warning(f"Circuit breaker OPEN (failures: {self.failure_count})")
self.state = CircuitState.OPEN

def can_request(self):
"""Check if a request should be attempted."""

if self.state == CircuitState.CLOSED:
return True

elif self.state == CircuitState.OPEN:
# Check if enough time has passed to retry
elapsed = time.time() - self.last_failure_time
if elapsed > self.reset_timeout:
logger.info("Circuit breaker HALF_OPEN (testing recovery)")
self.state = CircuitState.HALF_OPEN
return True
else:
return False

elif self.state == CircuitState.HALF_OPEN:
return True

def __str__(self):
return f"CircuitBreaker(state={self.state.value}, failures={self.failure_count})"

# Usage
breaker = CircuitBreaker(failure_threshold=3, reset_timeout=30)

for i in range(10):
print(f"\nRequest {i + 1}: {breaker}")

if not breaker.can_request():
logger.info("Request blocked by circuit breaker")
time.sleep(5)
continue

# Simulate a failing service (first 3 requests fail)
if i < 3:
logger.error("Request failed")
breaker.record_failure()
else:
logger.info("Request succeeded")
breaker.record_success()

Circuit breaker states:

  • CLOSED: Normal operation; all requests go through.
  • OPEN: Too many failures; requests rejected immediately (no retry).
  • HALF_OPEN: Testing if service recovered; allow one request; if successful, close; if fails, reopen.

Comprehensive Logging

Log at appropriate levels to understand what happened:

import logging
import sys

def setup_logging(log_file="scraper.log", level=logging.INFO):
"""Configure logging to file and console."""

# Create logger
logger = logging.getLogger("scraper")
logger.setLevel(level)

# File handler (DEBUG level for full detail)
file_handler = logging.FileHandler(log_file)
file_handler.setLevel(logging.DEBUG)
file_format = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - [%(filename)s:%(lineno)d] - %(message)s"
)
file_handler.setFormatter(file_format)

# Console handler (INFO level; less verbose)
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.INFO)
console_format = logging.Formatter("%(levelname)s: %(message)s")
console_handler.setFormatter(console_format)

# Add handlers
logger.addHandler(file_handler)
logger.addHandler(console_handler)

return logger

# Usage
logger = setup_logging("scraper.log")

# Log different levels
logger.debug("Debug: detailed information for diagnosis")
logger.info("Info: confirm everything working as expected")
logger.warning("Warning: something unexpected but handled")
logger.error("Error: serious problem, but scraper continues")
logger.exception("Exception: an error was caught and logged with traceback")

Log levels:

LevelUse Case
DEBUGDetailed diagnostic information (write to file only)
INFOConfirmation that things are working (console + file)
WARNINGSomething unexpected but handled (console + file)
ERRORSerious problem but scraper continues (console + file)
CRITICALUnrecoverable error; scraper stops

Resuming Interrupted Scrapes

Save progress to resume from the last successful page:

import json
import os
from datetime import datetime

class ResumableScraper:
def __init__(self, start_url, checkpoint_file="checkpoint.json"):
self.start_url = start_url
self.checkpoint_file = checkpoint_file
self.checkpoint = self.load_checkpoint()
self.logger = logging.getLogger(__name__)

def load_checkpoint(self):
"""Load the last checkpoint (resume state)."""

if os.path.exists(self.checkpoint_file):
try:
with open(self.checkpoint_file, "r") as f:
checkpoint = json.load(f)
self.logger.info(f"Loaded checkpoint: page {checkpoint.get('last_page')}")
return checkpoint
except Exception as e:
self.logger.error(f"Could not load checkpoint: {e}")

return {"last_page": 0, "last_url": self.start_url, "records_scraped": 0}

def save_checkpoint(self, page, url, records_count):
"""Save progress checkpoint."""

self.checkpoint = {
"last_page": page,
"last_url": url,
"records_scraped": records_count,
"timestamp": datetime.now().isoformat()
}

with open(self.checkpoint_file, "w") as f:
json.dump(self.checkpoint, f, indent=2)

self.logger.info(f"Checkpoint saved: page {page}, {records_count} records")

def scrape(self):
"""Scrape with checkpoint resumption."""

current_page = self.checkpoint.get("last_page", 0) + 1
current_url = self.checkpoint.get("last_url", self.start_url)
total_records = self.checkpoint.get("records_scraped", 0)

self.logger.info(f"Resuming from page {current_page}")

while current_page <= 100: # Example: scrape 100 pages max
try:
# Fetch page
response = requests.get(current_url, timeout=10)
response.raise_for_status()

# Extract and count records
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.item")

total_records += len(items)

# Save checkpoint every page
self.save_checkpoint(current_page, current_url, total_records)

self.logger.info(f"Page {current_page}: {len(items)} items ({total_records} total)")

# Construct next URL (example)
current_url = f"{self.start_url}?page={current_page + 1}"
current_page += 1

except requests.exceptions.RequestException as e:
self.logger.error(f"Error on page {current_page}: {e}")
self.logger.info("Will resume from last checkpoint on next run")
break

self.logger.info(f"Scraping complete. Total records: {total_records}")

# Usage
scraper = ResumableScraper("https://example.com/search")
scraper.scrape()
# If interrupted, restart and it resumes from last checkpoint

Key Takeaways

  • Catch exceptions by type (Timeout, ConnectionError, HTTPError) and handle each appropriately.
  • Retry transient errors (timeouts, 503) with exponential backoff; give up on permanent errors (404).
  • Use circuit breakers to stop retrying when a site is persistently broken.
  • Log at appropriate levels (DEBUG, INFO, WARNING, ERROR) to file and console.
  • Save checkpoints to resume interrupted scrapes without re-scraping completed pages.

Frequently Asked Questions

How many retries is appropriate?

3-5 retries is typical. Each retry doubles the wait time, so 5 retries with 1-second base delay means up to 32 seconds total. Adjust based on your tolerance for wait time and site reliability.

Should I retry all HTTP errors?

No. Retry 5xx errors (server problems) and 429 (rate limit). Don't retry 4xx errors (client problems): 404 (not found), 403 (forbidden), 401 (unauthorized). These won't change on retry.

When should I use a circuit breaker?

Use a circuit breaker when scraping multiple pages. If a site is down, a circuit breaker prevents wasting requests and bandwidth. For single-page scrapes, retry logic alone is sufficient.

How do I handle scrapes that take days to complete?

Checkpoint every N pages (10-100 records). Save to a file. If the scraper crashes, restart and it resumes from the checkpoint. For very long scrapes, consider running as a scheduled cron job that exits and resumes daily.

What should I log?

Log URLs fetched, item counts per page, errors, retries, and circuit breaker state changes. This gives you visibility into what happened if something goes wrong. Avoid logging sensitive data (passwords, personal info).

Further Reading