Web Scraping Pagination: Extract Multi-Page Data
Real-world scraping projects rarely end at a single page. Most websites split data across multiple pages using pagination: "next" links, numbered page parameters, or infinite-scroll loading. Building a scraper that reliably traverses pagination is a critical skill. This article covers every pagination pattern you will encounter, teaches you how to avoid infinite loops and duplicate data, and shows you how to manage scraper state (which pages you have visited) to resume interrupted jobs. You will build a pagination engine that gracefully handles edge cases like broken links, empty pages, and dynamic page counts.
I once spent 6 hours hunting a bug where my scraper was extracting 400,000 duplicate records. The culprit: I was not tracking visited pages, so when the site's pagination loop linked back to page 1, my scraper happily re-scraped everything. Implementing a visited-pages set reduced duplicates to zero. Pagination logic is simple, but the devil lives in the details.
Understanding Common Pagination Patterns
Websites use distinct pagination approaches. Here are the patterns you will encounter:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, parse_qs
def scrape_pattern_1_next_link():
"""Pattern 1: Pagination via 'Next' link (common on blogs)."""
visited_pages = set()
current_url = "https://example.com/blog"
all_data = []
while current_url and current_url not in visited_pages:
print(f"Scraping {current_url}")
visited_pages.add(current_url)
response = requests.get(current_url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Extract articles from current page
for article in soup.select("article.post"):
title = article.select_one("h2").get_text(strip=True)
url = article.select_one("a").get("href")
all_data.append({"title": title, "url": url})
# Find the next page link
next_link = soup.select_one("a.next-page")
if next_link:
# Make the URL absolute (handle relative links)
current_url = urljoin(current_url, next_link.get("href"))
else:
current_url = None
return all_data
def scrape_pattern_2_numbered_pages():
"""Pattern 2: Pagination via numbered pages (page=1, page=2, etc.)."""
base_url = "https://example.com/search"
all_data = []
# Attempt pages 1-50; stop if a page returns no results
for page_num in range(1, 51):
url = f"{base_url}?page={page_num}"
print(f"Scraping page {page_num}")
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Extract results
items = soup.select("div.result-item")
if not items:
print(f"Page {page_num} is empty. Stopping.")
break
for item in items:
title = item.select_one("h3").get_text(strip=True)
all_data.append({"title": title})
print(f" Found {len(items)} items on page {page_num}")
return all_data
def scrape_pattern_3_offset_limit():
"""Pattern 3: Offset/limit pagination (offset=0&limit=20)."""
base_url = "https://api.example.com/products"
all_data = []
offset = 0
limit = 20
while True:
url = f"{base_url}?offset={offset}&limit={limit}"
print(f"Fetching offset {offset}")
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.product")
if not items:
break
for item in items:
title = item.select_one("h2").get_text(strip=True)
all_data.append({"title": title})
# Increment offset for next iteration
offset += limit
return all_data
# Test the patterns (modify URLs to real sites)
print("Pattern 1: Next Link")
print("Pattern 2: Numbered Pages")
print("Pattern 3: Offset/Limit")
Each pattern requires different logic to detect when pagination ends:
| Pattern | Detection | Example URL |
|---|---|---|
| Next Link | Link absent or points back to first page | /blog?page=2 |
| Numbered Pages | Empty page or 404 status | /products?page=51 |
| Offset/Limit | Empty results list | /api/items?offset=100&limit=20 |
| Infinite Scroll | JavaScript loads more on scroll | Uses JSON API calls |
Building a Robust Pagination Scraper
Here is a complete, production-grade pagination scraper with deduplication, error recovery, and logging:
import requests
from bs4 import BeautifulSoup
import time
import json
from urllib.parse import urljoin
from datetime import datetime
class PaginationScraper:
def __init__(self, start_url, max_pages=None, delay_seconds=1):
self.start_url = start_url
self.max_pages = max_pages
self.delay_seconds = delay_seconds
self.visited_urls = set()
self.data = []
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
def fetch_page(self, url):
"""Fetch a page with error handling."""
try:
print(f"[{datetime.now().isoformat()}] Fetching {url}")
response = self.session.get(url, timeout=15)
response.raise_for_status()
return response
except requests.exceptions.Timeout:
print(f" [TIMEOUT] {url}")
return None
except requests.exceptions.RequestException as e:
print(f" [ERROR] {url}: {e}")
return None
def extract_data(self, soup):
"""Extract articles from the page."""
items = soup.select("article.post")
page_data = []
for item in items:
try:
title = item.select_one("h2").get_text(strip=True)
link = item.select_one("a").get("href")
# Deduplicate by URL
if link not in {d.get("url") for d in self.data}:
page_data.append({
"title": title,
"url": urljoin(self.start_url, link),
"scraped_at": datetime.now().isoformat()
})
except AttributeError:
# Missing elements; skip this item
continue
return page_data
def find_next_page(self, soup, current_url):
"""Find the next page URL."""
next_link = soup.select_one("a.pagination-next")
if next_link:
next_url = urljoin(current_url, next_link.get("href"))
# Avoid loops: don't revisit the same page
if next_url not in self.visited_urls:
return next_url
return None
def scrape(self):
"""Main pagination loop."""
current_url = self.start_url
pages_scraped = 0
while current_url and (not self.max_pages or pages_scraped < self.max_pages):
# Skip if already visited
if current_url in self.visited_urls:
print(f"[DUPLICATE] Already visited {current_url}. Stopping.")
break
self.visited_urls.add(current_url)
# Fetch and parse
response = self.fetch_page(current_url)
if not response:
print(f"[SKIP] Could not fetch {current_url}")
break
soup = BeautifulSoup(response.text, "html.parser")
# Extract data
page_data = self.extract_data(soup)
self.data.extend(page_data)
pages_scraped += 1
print(f" Extracted {len(page_data)} items from page {pages_scraped}")
# Find next page
current_url = self.find_next_page(soup, current_url)
# Delay to avoid hammering the server
if current_url:
time.sleep(self.delay_seconds)
return self.data
# Usage
scraper = PaginationScraper(
start_url="https://example.com/blog",
max_pages=10,
delay_seconds=2
)
results = scraper.scrape()
print(f"\nTotal items scraped: {len(results)}")
print(f"Total pages visited: {len(scraper.visited_urls)}")
# Save to JSON
with open("results.json", "w", encoding="utf-8") as f:
json.dump(results, f, indent=2)
Key patterns in this scraper:
- Visited tracking: prevents infinite loops and duplicate extraction.
- Deduplication: checks URLs in extracted data before adding.
- Error handling: gracefully skips failed requests and missing elements.
- Delay: respectful spacing between requests.
- Logging: timestamps and status messages for debugging.
Handling Numbered Page Parameters
Many sites use simple numeric page parameters. Here is a scraper optimized for that pattern:
import requests
from bs4 import BeautifulSoup
import time
class NumberedPageScraper:
def __init__(self, base_url, param_name="page"):
self.base_url = base_url
self.param_name = param_name
self.data = []
def scrape_range(self, start=1, end=10):
"""Scrape a range of numbered pages."""
for page_num in range(start, end + 1):
# Build URL with page parameter
url = f"{self.base_url}?{self.param_name}={page_num}"
print(f"Scraping page {page_num}: {url}")
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f" Error: {e}")
continue
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.item")
if not items:
print(f" Page {page_num} is empty. Stopping.")
break
# Extract data from items
for item in items:
title = item.select_one("h3")
if title:
self.data.append({
"title": title.get_text(strip=True),
"page": page_num
})
print(f" Found {len(items)} items")
time.sleep(1) # Be respectful
return self.data
# Usage
scraper = NumberedPageScraper("https://example.com/search")
results = scraper.scrape_range(start=1, end=50)
print(f"Total: {len(results)} items")
Detecting Pagination Limits Safely
Always have a safety limit to prevent runaway scrapers:
import requests
from bs4 import BeautifulSoup
import time
def scrape_until_empty(start_url, max_pages=100, timeout_seconds=300):
"""Scrape pages until one returns no results, with safeguards."""
data = []
page = 1
start_time = time.time()
while page <= max_pages:
# Timeout safety: stop if scraping takes too long
if time.time() - start_time > timeout_seconds:
print(f"Timeout: Scraping took more than {timeout_seconds}s")
break
url = f"{start_url}?page={page}"
print(f"Page {page}: {url}")
response = requests.get(url, timeout=10)
if response.status_code != 200:
print(f" HTTP {response.status_code}. Stopping.")
break
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.item")
if not items:
print(f" Empty page. Stopping.")
break
data.extend([{"page": page, "count": len(items)}])
print(f" {len(items)} items")
page += 1
time.sleep(1)
return data
results = scrape_until_empty("https://example.com/items", max_pages=50)
print(f"Scraped {len(results)} pages")
Always include a max_pages limit and optionally a timeout to prevent your scraper from running indefinitely on sites with broken pagination logic.
Key Takeaways
- Pagination has multiple patterns: next links, numbered pages, offset/limit, and infinite scroll.
- Always track visited pages to prevent duplicate extraction and infinite loops.
- Implement deduplication by checking URLs before adding data.
- Use delays between requests to be respectful to the server.
- Set maximum page limits and timeouts as safety valves against runaway scrapers.
Frequently Asked Questions
How do I know when pagination ends?
Empty results, missing next links, or HTTP errors (404, 403) indicate the end. Set a max_pages limit as a safety net. Some sites do not signal pagination end clearly; a timeout is safer than a breakpoint.
What if the "next" link points back to page 1?
Check if next_url is in visited_urls before following it. This prevents loops. If the site has broken pagination, set a max_pages limit.
How do I scrape pages faster without getting blocked?
Use the requests.Session to reuse connections. Space requests with time.sleep() (1-3 seconds between pages). Vary your User-Agent. If limits persist, the site may require authentication or IP rotation.
Can I use threading to parallelize pagination scraping?
Technically yes, but it increases risk of being blocked and causes race conditions with shared state (visited URLs, data list). Stick to sequential scraping unless you implement careful thread locking and request queuing (see Article 9).
What if the site uses JavaScript to load more content dynamically?
Static pagination (links/numbered pages) works with BeautifulSoup. If content loads via JavaScript (infinite scroll), you need Playwright (Article 5) to render the page in a browser before parsing.
Further Reading
- Requests: Session Objects — Efficient multi-request patterns with connection pooling.
- Beautiful Soup: Searching the Tree — Full reference for finding and iterating elements.
- urllib.parse Documentation — URL parsing and manipulation (urljoin, parse_qs).
- Web Scraping Best Practices — Guidelines for respectful, resilient pagination.