Skip to main content

Rate Limiting and Respectful Scraping: Avoid Blocks

Web scraping at scale attracts detection. Servers monitor for patterns: too many requests per second, identical User-Agent headers, requests without referers, or requests from known datacenter IPs. Getting blocked (HTTP 429, IP ban, or CAPTCHA challenge) kills your scraper. This article teaches you rate limiting, header rotation, connection pooling, and proxy strategies that keep you operational without harming the target site. You will build a scraper that behaves like a patient human rather than a relentless bot. The goal is sustainable extraction that respects server resources while achieving your objectives.

I once scraped 10,000 pages in 2 hours and got my home IP banned for a week. Since then, I implement rate limiting religiously: 1-3 second delays, rotating headers, and proxy services when needed. Those practices have kept my scrapers alive for years.

Rate Limiting: The Foundation of Respectful Scraping

Rate limiting means spacing out requests so the server can handle them without overload. Start with simple delays, then optimize:

import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime, timedelta
import random

class RateLimitedScraper:
def __init__(self, delay_seconds=2, jitter=0.5):
"""
delay_seconds: Base delay between requests.
jitter: Random variation (0-1) to appear more human-like.
"""
self.delay_seconds = delay_seconds
self.jitter = jitter
self.last_request_time = None
self.session = requests.Session()

def get(self, url):
"""Fetch a URL with rate limiting."""

# Enforce delay since last request
if self.last_request_time:
elapsed = time.time() - self.last_request_time
actual_delay = self.delay_seconds + random.uniform(0, self.jitter)
sleep_time = actual_delay - elapsed

if sleep_time > 0:
print(f"Sleeping {sleep_time:.2f}s to respect rate limit")
time.sleep(sleep_time)

self.last_request_time = time.time()

print(f"[{datetime.now().isoformat()}] GET {url}")
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response

# Usage
scraper = RateLimitedScraper(delay_seconds=2, jitter=0.5)

urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3"
]

for url in urls:
response = scraper.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Extract data...
print(f" Status: {response.status_code}")

Key points:

  • Delay: 1-3 seconds between requests is respectful. 0.5 seconds is aggressive.
  • Jitter: Random variation (0-1 seconds) makes your pattern less detectable.
  • Track last request time: Ensures minimum delay without hard sleep calls.

Rotating User-Agent Headers

Many sites block requests with missing or suspicious User-Agent headers. Rotating headers makes your scraper less detectable:

import requests
from bs4 import BeautifulSoup
import random

# Pool of realistic User-Agent strings
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
]

class RotatingHeaderScraper:
def __init__(self):
self.session = requests.Session()

def get_headers(self):
"""Return a header dict with a random User-Agent."""
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}

def get(self, url):
"""Fetch with rotating headers."""
headers = self.get_headers()
print(f"User-Agent: {headers['User-Agent'][:50]}...")

response = self.session.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response

# Usage
scraper = RotatingHeaderScraper()

for i in range(5):
response = scraper.get("https://example.com")
print(f"Request {i+1}: Status {response.status_code}\n")

Best practices for headers:

  • Include User-Agent, Accept, Accept-Language, and Referer.
  • Rotate User-Agent between real browser versions.
  • Never send identical headers for every request (telltale sign of a bot).

Handling Rate Limit Responses (429 and Backoff)

When you hit a rate limit, the server responds with HTTP 429. Back off exponentially:

import requests
from bs4 import BeautifulSoup
import time
import random

def fetch_with_backoff(url, max_retries=5):
"""Fetch a URL with exponential backoff on 429 errors."""

for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)

if response.status_code == 200:
return response

elif response.status_code == 429:
# Rate limited; back off exponentially
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited (429). Waiting {wait_time:.1f}s before retry...")
time.sleep(wait_time)
continue

elif response.status_code == 503:
# Service unavailable; retry
wait_time = 5 * (attempt + 1)
print(f"Service unavailable (503). Waiting {wait_time}s...")
time.sleep(wait_time)
continue

else:
response.raise_for_status()

except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
if attempt < max_retries - 1:
wait_time = (2 ** attempt)
print(f"Retrying in {wait_time}s...")
time.sleep(wait_time)
else:
raise

raise Exception(f"Failed after {max_retries} attempts")

# Usage
response = fetch_with_backoff("https://example.com/api/data")
print(f"Success: {response.status_code}")

Exponential backoff (2^attempt) is standard: 1s, 2s, 4s, 8s, 16s. Add jitter to avoid thundering herd problems.

Using Rotating Proxies

For large-scale scraping, rotate your IP address using proxies:

import requests
from bs4 import BeautifulSoup
import random

# Free proxy list (WARNING: free proxies are often unreliable and slow)
PROXIES = [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
]

# Better: use a paid service like Bright Data, Oxylabs, or ScraperAPI
# Example with ScraperAPI (handles rotation for you):
SCRAPER_API_KEY = "your_api_key_here"

class ProxyRotatingScraper:
def __init__(self, proxies=None):
self.proxies = proxies or PROXIES

def get_proxy_dict(self):
"""Return a proxy dict with a random proxy."""
proxy = random.choice(self.proxies)
return {
"http": proxy,
"https": proxy
}

def get(self, url):
"""Fetch with a rotating proxy."""
proxy_dict = self.get_proxy_dict()
print(f"Using proxy: {proxy_dict['http']}")

try:
response = requests.get(
url,
proxies=proxy_dict,
timeout=10
)
response.raise_for_status()
return response
except requests.exceptions.ProxyError:
print(f"Proxy failed. Trying another...")
# Retry with a different proxy
return self.get(url)

# Example with ScraperAPI (recommended for beginners)
class ScraperAPIScraper:
def __init__(self, api_key):
self.api_key = api_key

def get(self, url):
"""Fetch via ScraperAPI (handles proxies, headers, JS rendering)."""
payload = {
"api_key": self.api_key,
"url": url
}
response = requests.get("http://api.scraperapi.com", params=payload)
response.raise_for_status()
return response

# Usage with free proxies (not recommended)
# scraper = ProxyRotatingScraper()
# response = scraper.get("https://example.com")

# Usage with ScraperAPI (recommended)
# scraper = ScraperAPIScraper("your_api_key")
# response = scraper.get("https://example.com")

Proxy considerations:

  • Free proxies: Slow, unreliable, often malicious. Avoid for production.
  • Paid services (Bright Data, Oxylabs, ScraperAPI): Fast, reliable, handle rotation for you. $50-500/month depending on volume.
  • Residential proxies: Real home IPs; slower but very hard to detect.
  • Datacenter proxies: Fast but easy to detect and block.

Combining All Strategies: A Robust Scraper

Here is a complete scraper that combines rate limiting, header rotation, backoff, and optional proxy support:

import requests
from bs4 import BeautifulSoup
import time
import random
from datetime import datetime

class RobustScraper:
def __init__(self, delay_seconds=2, use_proxy=False, proxy_url=None):
self.delay_seconds = delay_seconds
self.use_proxy = use_proxy
self.proxy_url = proxy_url
self.session = requests.Session()
self.last_request_time = None

self.user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/91.0",
"Mozilla/5.0 (X11; Linux x86_64) Chrome/91.0",
]

def get_headers(self):
"""Return headers with a random User-Agent."""
return {
"User-Agent": random.choice(self.user_agents),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive"
}

def get_proxies(self):
"""Return proxy dict if configured."""
if self.use_proxy and self.proxy_url:
return {
"http": self.proxy_url,
"https": self.proxy_url
}
return None

def fetch(self, url, max_retries=3):
"""Fetch with rate limiting, header rotation, and backoff."""

# Rate limiting
if self.last_request_time:
elapsed = time.time() - self.last_request_time
delay = self.delay_seconds + random.uniform(0, 0.5)
sleep_time = delay - elapsed
if sleep_time > 0:
time.sleep(sleep_time)

self.last_request_time = time.time()

# Retry loop with exponential backoff
for attempt in range(max_retries):
try:
print(f"[{datetime.now().isoformat()}] Fetching {url} (attempt {attempt+1})")

response = self.session.get(
url,
headers=self.get_headers(),
proxies=self.get_proxies(),
timeout=10
)

if response.status_code == 200:
return response

elif response.status_code == 429:
wait = (2 ** attempt) + random.uniform(0, 1)
print(f" Rate limited. Waiting {wait:.1f}s...")
time.sleep(wait)
continue

else:
response.raise_for_status()

except requests.exceptions.RequestException as e:
if attempt < max_retries - 1:
wait = 2 ** attempt
print(f" Error: {e}. Retrying in {wait}s...")
time.sleep(wait)
else:
raise

return None

# Usage
scraper = RobustScraper(delay_seconds=2)

urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3"
]

for url in urls:
response = scraper.fetch(url)
if response:
soup = BeautifulSoup(response.text, "html.parser")
print(f" Extracted data from {url}")

Key Takeaways

  • Rate limiting (1-3 second delays) is the foundation of ethical scraping and helps avoid IP bans.
  • Rotate User-Agent headers and include realistic headers to appear human-like.
  • Implement exponential backoff for 429 and 503 errors to handle rate limits gracefully.
  • Use rotating proxies for large-scale scraping; free proxies are unreliable; paid services are worth it.
  • Combine rate limiting, header rotation, backoff, and proxies into a single robust scraper class.

Frequently Asked Questions

How do I know what delay is appropriate?

Check the site's robots.txt and Terms of Service. A safe default is 2-5 seconds. If you get 429 errors, increase delays. If the site is under-loaded, you can reduce to 0.5-1 second.

Will adding jitter really help avoid detection?

Yes. Bots often make requests at exact intervals (every 2.0 seconds); humans vary (1.8, 2.3, 2.1 seconds). Adding small random jitter makes your pattern less detectable.

Should I use free proxies or paid?

Paid proxies are reliable and worth the cost for production scraping. Free proxies are 50% successful and extremely slow. For testing, free is okay; for production, pay.

What happens if I get a CAPTCHA?

Playwright can interact with the page, but solving CAPTCHAs requires human solving services (e.g., 2captcha, DeathByCaptcha) or bypassing (complex, against terms of service). Best practice: slow down, space requests, and avoid CAPTCHAs through respectful behavior.

Can I use the same session for multiple URLs to reduce overhead?

Yes. requests.Session() reuses connections and cookies, improving speed. Share a session across requests, but still respect rate limiting.

Further Reading