Error Handling and Resilience in Web Scrapers
Scraping the internet is inherently unreliable. Networks fail, servers go down, pages change, and requests timeout. A production scraper must handle these failures gracefully, retry intelligently, log problems, and recover without human intervention. This article covers retry logic with exponential backoff, circuit breakers to detect broken sites, comprehensive logging for debugging, and patterns for resuming interrupted scrapes. You will build a scraper that stays operational through transient failures and alerts you to permanent problems. By the end, your scraper will be robust enough for unsupervised operation.
I learned error handling the hard way. Early in my scraping career, a scraper crashed at 99.5% completion due to a single unhandled exception. Since then, I wrap every external call in try-catch, implement circuit breakers, and log extensively. These investments have saved me countless hours of debugging.
Comprehensive Try-Catch and Exception Handling
Always anticipate and handle exceptions explicitly. Here is a robust pattern:
import requests
from bs4 import BeautifulSoup
import logging
import time
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
handlers=[
logging.FileHandler("scraper.log"),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
def fetch_and_parse(url):
"""Fetch and parse a URL with comprehensive exception handling."""
try:
logger.info(f"Fetching {url}")
# Timeout after 15 seconds
response = requests.get(url, timeout=15)
# Check status code
if response.status_code != 200:
logger.warning(f"HTTP {response.status_code} for {url}")
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")
logger.info(f"Successfully parsed {url}")
return soup
except requests.exceptions.Timeout:
logger.error(f"Request timed out: {url}")
return None
except requests.exceptions.ConnectionError:
logger.error(f"Connection error: {url}")
return None
except requests.exceptions.HTTPError as e:
logger.error(f"HTTP error {response.status_code}: {url}")
return None
except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {url} - {e}")
return None
except Exception as e:
logger.exception(f"Unexpected error for {url}: {e}")
return None
# Usage
soup = fetch_and_parse("https://example.com")
if soup:
# Process soup
pass
else:
logger.warning("Could not fetch page")
Exception hierarchy (most to least specific):
requests.exceptions.Timeout— request exceeded timeoutrequests.exceptions.ConnectionError— network unreachablerequests.exceptions.HTTPError— bad status code (4xx, 5xx)requests.exceptions.RequestException— all requests errorsBeautifulSoupexceptions (less common)- Generic
Exception— catch-all
Retry Logic with Exponential Backoff
Transient errors (timeouts, 503 Service Unavailable) often resolve with retry. Use exponential backoff:
import requests
import time
import random
import logging
logger = logging.getLogger(__name__)
def fetch_with_retry(url, max_retries=5, base_delay=1):
"""
Fetch a URL with exponential backoff retries.
base_delay: Initial delay in seconds (grows as 2^attempt).
max_retries: Maximum retry attempts.
"""
for attempt in range(max_retries):
try:
logger.info(f"Attempt {attempt + 1}/{max_retries}: {url}")
response = requests.get(url, timeout=10)
response.raise_for_status()
logger.info(f"Success: {url}")
return response
except requests.exceptions.Timeout:
logger.warning(f"Timeout on attempt {attempt + 1}")
except requests.exceptions.ConnectionError:
logger.warning(f"Connection error on attempt {attempt + 1}")
except requests.exceptions.HTTPError as e:
if response.status_code in [503, 504, 429]:
# Temporary server errors; retry
logger.warning(f"HTTP {response.status_code}; will retry")
else:
# Permanent errors (404, 403); give up
logger.error(f"HTTP {response.status_code}; giving up")
return None
except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {e}")
return None
# Calculate backoff
if attempt < max_retries - 1:
# Exponential backoff with jitter
wait_time = (base_delay * (2 ** attempt)) + random.uniform(0, 1)
logger.info(f"Retrying in {wait_time:.1f} seconds...")
time.sleep(wait_time)
logger.error(f"Failed after {max_retries} attempts: {url}")
return None
# Usage
response = fetch_with_retry("https://example.com", max_retries=3)
Exponential backoff schedule (base_delay=1):
| Attempt | Wait time (seconds) |
|---|---|
| 1 | 1 + jitter |
| 2 | 2 + jitter |
| 3 | 4 + jitter |
| 4 | 8 + jitter |
| 5 | 16 + jitter |
This prevents hammering a slow/recovering server and avoids getting blocked.
Circuit Breaker Pattern
A circuit breaker stops retrying when a site is persistently broken, preventing wasted requests:
import time
import logging
from enum import Enum
logger = logging.getLogger(__name__)
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Too many failures; reject requests
HALF_OPEN = "half_open" # Testing if the service recovered
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
"""
failure_threshold: Number of failures before opening circuit.
reset_timeout: Seconds to wait before trying again.
"""
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def record_success(self):
"""Record a successful request."""
if self.state == CircuitState.HALF_OPEN:
logger.info("Circuit breaker CLOSED (service recovered)")
self.state = CircuitState.CLOSED
self.failure_count = 0
def record_failure(self):
"""Record a failed request."""
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
logger.warning(f"Circuit breaker OPEN (failures: {self.failure_count})")
self.state = CircuitState.OPEN
def can_request(self):
"""Check if a request should be attempted."""
if self.state == CircuitState.CLOSED:
return True
elif self.state == CircuitState.OPEN:
# Check if enough time has passed to retry
elapsed = time.time() - self.last_failure_time
if elapsed > self.reset_timeout:
logger.info("Circuit breaker HALF_OPEN (testing recovery)")
self.state = CircuitState.HALF_OPEN
return True
else:
return False
elif self.state == CircuitState.HALF_OPEN:
return True
def __str__(self):
return f"CircuitBreaker(state={self.state.value}, failures={self.failure_count})"
# Usage
breaker = CircuitBreaker(failure_threshold=3, reset_timeout=30)
for i in range(10):
print(f"\nRequest {i + 1}: {breaker}")
if not breaker.can_request():
logger.info("Request blocked by circuit breaker")
time.sleep(5)
continue
# Simulate a failing service (first 3 requests fail)
if i < 3:
logger.error("Request failed")
breaker.record_failure()
else:
logger.info("Request succeeded")
breaker.record_success()
Circuit breaker states:
- CLOSED: Normal operation; all requests go through.
- OPEN: Too many failures; requests rejected immediately (no retry).
- HALF_OPEN: Testing if service recovered; allow one request; if successful, close; if fails, reopen.
Comprehensive Logging
Log at appropriate levels to understand what happened:
import logging
import sys
def setup_logging(log_file="scraper.log", level=logging.INFO):
"""Configure logging to file and console."""
# Create logger
logger = logging.getLogger("scraper")
logger.setLevel(level)
# File handler (DEBUG level for full detail)
file_handler = logging.FileHandler(log_file)
file_handler.setLevel(logging.DEBUG)
file_format = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - [%(filename)s:%(lineno)d] - %(message)s"
)
file_handler.setFormatter(file_format)
# Console handler (INFO level; less verbose)
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.INFO)
console_format = logging.Formatter("%(levelname)s: %(message)s")
console_handler.setFormatter(console_format)
# Add handlers
logger.addHandler(file_handler)
logger.addHandler(console_handler)
return logger
# Usage
logger = setup_logging("scraper.log")
# Log different levels
logger.debug("Debug: detailed information for diagnosis")
logger.info("Info: confirm everything working as expected")
logger.warning("Warning: something unexpected but handled")
logger.error("Error: serious problem, but scraper continues")
logger.exception("Exception: an error was caught and logged with traceback")
Log levels:
| Level | Use Case |
|---|---|
| DEBUG | Detailed diagnostic information (write to file only) |
| INFO | Confirmation that things are working (console + file) |
| WARNING | Something unexpected but handled (console + file) |
| ERROR | Serious problem but scraper continues (console + file) |
| CRITICAL | Unrecoverable error; scraper stops |
Resuming Interrupted Scrapes
Save progress to resume from the last successful page:
import json
import os
from datetime import datetime
class ResumableScraper:
def __init__(self, start_url, checkpoint_file="checkpoint.json"):
self.start_url = start_url
self.checkpoint_file = checkpoint_file
self.checkpoint = self.load_checkpoint()
self.logger = logging.getLogger(__name__)
def load_checkpoint(self):
"""Load the last checkpoint (resume state)."""
if os.path.exists(self.checkpoint_file):
try:
with open(self.checkpoint_file, "r") as f:
checkpoint = json.load(f)
self.logger.info(f"Loaded checkpoint: page {checkpoint.get('last_page')}")
return checkpoint
except Exception as e:
self.logger.error(f"Could not load checkpoint: {e}")
return {"last_page": 0, "last_url": self.start_url, "records_scraped": 0}
def save_checkpoint(self, page, url, records_count):
"""Save progress checkpoint."""
self.checkpoint = {
"last_page": page,
"last_url": url,
"records_scraped": records_count,
"timestamp": datetime.now().isoformat()
}
with open(self.checkpoint_file, "w") as f:
json.dump(self.checkpoint, f, indent=2)
self.logger.info(f"Checkpoint saved: page {page}, {records_count} records")
def scrape(self):
"""Scrape with checkpoint resumption."""
current_page = self.checkpoint.get("last_page", 0) + 1
current_url = self.checkpoint.get("last_url", self.start_url)
total_records = self.checkpoint.get("records_scraped", 0)
self.logger.info(f"Resuming from page {current_page}")
while current_page <= 100: # Example: scrape 100 pages max
try:
# Fetch page
response = requests.get(current_url, timeout=10)
response.raise_for_status()
# Extract and count records
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.item")
total_records += len(items)
# Save checkpoint every page
self.save_checkpoint(current_page, current_url, total_records)
self.logger.info(f"Page {current_page}: {len(items)} items ({total_records} total)")
# Construct next URL (example)
current_url = f"{self.start_url}?page={current_page + 1}"
current_page += 1
except requests.exceptions.RequestException as e:
self.logger.error(f"Error on page {current_page}: {e}")
self.logger.info("Will resume from last checkpoint on next run")
break
self.logger.info(f"Scraping complete. Total records: {total_records}")
# Usage
scraper = ResumableScraper("https://example.com/search")
scraper.scrape()
# If interrupted, restart and it resumes from last checkpoint
Key Takeaways
- Catch exceptions by type (Timeout, ConnectionError, HTTPError) and handle each appropriately.
- Retry transient errors (timeouts, 503) with exponential backoff; give up on permanent errors (404).
- Use circuit breakers to stop retrying when a site is persistently broken.
- Log at appropriate levels (DEBUG, INFO, WARNING, ERROR) to file and console.
- Save checkpoints to resume interrupted scrapes without re-scraping completed pages.
Frequently Asked Questions
How many retries is appropriate?
3-5 retries is typical. Each retry doubles the wait time, so 5 retries with 1-second base delay means up to 32 seconds total. Adjust based on your tolerance for wait time and site reliability.
Should I retry all HTTP errors?
No. Retry 5xx errors (server problems) and 429 (rate limit). Don't retry 4xx errors (client problems): 404 (not found), 403 (forbidden), 401 (unauthorized). These won't change on retry.
When should I use a circuit breaker?
Use a circuit breaker when scraping multiple pages. If a site is down, a circuit breaker prevents wasting requests and bandwidth. For single-page scrapes, retry logic alone is sufficient.
How do I handle scrapes that take days to complete?
Checkpoint every N pages (10-100 records). Save to a file. If the scraper crashes, restart and it resumes from the checkpoint. For very long scrapes, consider running as a scheduled cron job that exits and resumes daily.
What should I log?
Log URLs fetched, item counts per page, errors, retries, and circuit breaker state changes. This gives you visibility into what happened if something goes wrong. Avoid logging sensitive data (passwords, personal info).
Further Reading
- Python logging Module — Official documentation for logging.
- Requests Exceptions — Exception hierarchy in requests library.
- Circuit Breaker Pattern (Martin Fowler) — Architecture pattern explanation.
- Exponential Backoff (AWS Documentation) — Best practices for retry delays.