Skip to main content

Web Scraping Pagination: Extract Multi-Page Data

Real-world scraping projects rarely end at a single page. Most websites split data across multiple pages using pagination: "next" links, numbered page parameters, or infinite-scroll loading. Building a scraper that reliably traverses pagination is a critical skill. This article covers every pagination pattern you will encounter, teaches you how to avoid infinite loops and duplicate data, and shows you how to manage scraper state (which pages you have visited) to resume interrupted jobs. You will build a pagination engine that gracefully handles edge cases like broken links, empty pages, and dynamic page counts.

I once spent 6 hours hunting a bug where my scraper was extracting 400,000 duplicate records. The culprit: I was not tracking visited pages, so when the site's pagination loop linked back to page 1, my scraper happily re-scraped everything. Implementing a visited-pages set reduced duplicates to zero. Pagination logic is simple, but the devil lives in the details.

Understanding Common Pagination Patterns

Websites use distinct pagination approaches. Here are the patterns you will encounter:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, parse_qs

def scrape_pattern_1_next_link():
"""Pattern 1: Pagination via 'Next' link (common on blogs)."""

visited_pages = set()
current_url = "https://example.com/blog"
all_data = []

while current_url and current_url not in visited_pages:
print(f"Scraping {current_url}")
visited_pages.add(current_url)

response = requests.get(current_url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

# Extract articles from current page
for article in soup.select("article.post"):
title = article.select_one("h2").get_text(strip=True)
url = article.select_one("a").get("href")
all_data.append({"title": title, "url": url})

# Find the next page link
next_link = soup.select_one("a.next-page")
if next_link:
# Make the URL absolute (handle relative links)
current_url = urljoin(current_url, next_link.get("href"))
else:
current_url = None

return all_data

def scrape_pattern_2_numbered_pages():
"""Pattern 2: Pagination via numbered pages (page=1, page=2, etc.)."""

base_url = "https://example.com/search"
all_data = []

# Attempt pages 1-50; stop if a page returns no results
for page_num in range(1, 51):
url = f"{base_url}?page={page_num}"
print(f"Scraping page {page_num}")

response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

# Extract results
items = soup.select("div.result-item")
if not items:
print(f"Page {page_num} is empty. Stopping.")
break

for item in items:
title = item.select_one("h3").get_text(strip=True)
all_data.append({"title": title})

print(f" Found {len(items)} items on page {page_num}")

return all_data

def scrape_pattern_3_offset_limit():
"""Pattern 3: Offset/limit pagination (offset=0&limit=20)."""

base_url = "https://api.example.com/products"
all_data = []
offset = 0
limit = 20

while True:
url = f"{base_url}?offset={offset}&limit={limit}"
print(f"Fetching offset {offset}")

response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

items = soup.select("div.product")
if not items:
break

for item in items:
title = item.select_one("h2").get_text(strip=True)
all_data.append({"title": title})

# Increment offset for next iteration
offset += limit

return all_data

# Test the patterns (modify URLs to real sites)
print("Pattern 1: Next Link")
print("Pattern 2: Numbered Pages")
print("Pattern 3: Offset/Limit")

Each pattern requires different logic to detect when pagination ends:

PatternDetectionExample URL
Next LinkLink absent or points back to first page/blog?page=2
Numbered PagesEmpty page or 404 status/products?page=51
Offset/LimitEmpty results list/api/items?offset=100&limit=20
Infinite ScrollJavaScript loads more on scrollUses JSON API calls

Building a Robust Pagination Scraper

Here is a complete, production-grade pagination scraper with deduplication, error recovery, and logging:

import requests
from bs4 import BeautifulSoup
import time
import json
from urllib.parse import urljoin
from datetime import datetime

class PaginationScraper:
def __init__(self, start_url, max_pages=None, delay_seconds=1):
self.start_url = start_url
self.max_pages = max_pages
self.delay_seconds = delay_seconds
self.visited_urls = set()
self.data = []
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

def fetch_page(self, url):
"""Fetch a page with error handling."""
try:
print(f"[{datetime.now().isoformat()}] Fetching {url}")
response = self.session.get(url, timeout=15)
response.raise_for_status()
return response
except requests.exceptions.Timeout:
print(f" [TIMEOUT] {url}")
return None
except requests.exceptions.RequestException as e:
print(f" [ERROR] {url}: {e}")
return None

def extract_data(self, soup):
"""Extract articles from the page."""
items = soup.select("article.post")
page_data = []

for item in items:
try:
title = item.select_one("h2").get_text(strip=True)
link = item.select_one("a").get("href")
# Deduplicate by URL
if link not in {d.get("url") for d in self.data}:
page_data.append({
"title": title,
"url": urljoin(self.start_url, link),
"scraped_at": datetime.now().isoformat()
})
except AttributeError:
# Missing elements; skip this item
continue

return page_data

def find_next_page(self, soup, current_url):
"""Find the next page URL."""
next_link = soup.select_one("a.pagination-next")
if next_link:
next_url = urljoin(current_url, next_link.get("href"))
# Avoid loops: don't revisit the same page
if next_url not in self.visited_urls:
return next_url
return None

def scrape(self):
"""Main pagination loop."""
current_url = self.start_url
pages_scraped = 0

while current_url and (not self.max_pages or pages_scraped < self.max_pages):
# Skip if already visited
if current_url in self.visited_urls:
print(f"[DUPLICATE] Already visited {current_url}. Stopping.")
break

self.visited_urls.add(current_url)

# Fetch and parse
response = self.fetch_page(current_url)
if not response:
print(f"[SKIP] Could not fetch {current_url}")
break

soup = BeautifulSoup(response.text, "html.parser")

# Extract data
page_data = self.extract_data(soup)
self.data.extend(page_data)
pages_scraped += 1
print(f" Extracted {len(page_data)} items from page {pages_scraped}")

# Find next page
current_url = self.find_next_page(soup, current_url)

# Delay to avoid hammering the server
if current_url:
time.sleep(self.delay_seconds)

return self.data

# Usage
scraper = PaginationScraper(
start_url="https://example.com/blog",
max_pages=10,
delay_seconds=2
)
results = scraper.scrape()

print(f"\nTotal items scraped: {len(results)}")
print(f"Total pages visited: {len(scraper.visited_urls)}")

# Save to JSON
with open("results.json", "w", encoding="utf-8") as f:
json.dump(results, f, indent=2)

Key patterns in this scraper:

  • Visited tracking: prevents infinite loops and duplicate extraction.
  • Deduplication: checks URLs in extracted data before adding.
  • Error handling: gracefully skips failed requests and missing elements.
  • Delay: respectful spacing between requests.
  • Logging: timestamps and status messages for debugging.

Handling Numbered Page Parameters

Many sites use simple numeric page parameters. Here is a scraper optimized for that pattern:

import requests
from bs4 import BeautifulSoup
import time

class NumberedPageScraper:
def __init__(self, base_url, param_name="page"):
self.base_url = base_url
self.param_name = param_name
self.data = []

def scrape_range(self, start=1, end=10):
"""Scrape a range of numbered pages."""
for page_num in range(start, end + 1):
# Build URL with page parameter
url = f"{self.base_url}?{self.param_name}={page_num}"
print(f"Scraping page {page_num}: {url}")

try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f" Error: {e}")
continue

soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.item")

if not items:
print(f" Page {page_num} is empty. Stopping.")
break

# Extract data from items
for item in items:
title = item.select_one("h3")
if title:
self.data.append({
"title": title.get_text(strip=True),
"page": page_num
})

print(f" Found {len(items)} items")
time.sleep(1) # Be respectful

return self.data

# Usage
scraper = NumberedPageScraper("https://example.com/search")
results = scraper.scrape_range(start=1, end=50)
print(f"Total: {len(results)} items")

Detecting Pagination Limits Safely

Always have a safety limit to prevent runaway scrapers:

import requests
from bs4 import BeautifulSoup
import time

def scrape_until_empty(start_url, max_pages=100, timeout_seconds=300):
"""Scrape pages until one returns no results, with safeguards."""

data = []
page = 1
start_time = time.time()

while page <= max_pages:
# Timeout safety: stop if scraping takes too long
if time.time() - start_time > timeout_seconds:
print(f"Timeout: Scraping took more than {timeout_seconds}s")
break

url = f"{start_url}?page={page}"
print(f"Page {page}: {url}")

response = requests.get(url, timeout=10)
if response.status_code != 200:
print(f" HTTP {response.status_code}. Stopping.")
break

soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div.item")

if not items:
print(f" Empty page. Stopping.")
break

data.extend([{"page": page, "count": len(items)}])
print(f" {len(items)} items")

page += 1
time.sleep(1)

return data

results = scrape_until_empty("https://example.com/items", max_pages=50)
print(f"Scraped {len(results)} pages")

Always include a max_pages limit and optionally a timeout to prevent your scraper from running indefinitely on sites with broken pagination logic.

Key Takeaways

  • Pagination has multiple patterns: next links, numbered pages, offset/limit, and infinite scroll.
  • Always track visited pages to prevent duplicate extraction and infinite loops.
  • Implement deduplication by checking URLs before adding data.
  • Use delays between requests to be respectful to the server.
  • Set maximum page limits and timeouts as safety valves against runaway scrapers.

Frequently Asked Questions

How do I know when pagination ends?

Empty results, missing next links, or HTTP errors (404, 403) indicate the end. Set a max_pages limit as a safety net. Some sites do not signal pagination end clearly; a timeout is safer than a breakpoint.

Check if next_url is in visited_urls before following it. This prevents loops. If the site has broken pagination, set a max_pages limit.

How do I scrape pages faster without getting blocked?

Use the requests.Session to reuse connections. Space requests with time.sleep() (1-3 seconds between pages). Vary your User-Agent. If limits persist, the site may require authentication or IP rotation.

Can I use threading to parallelize pagination scraping?

Technically yes, but it increases risk of being blocked and causes race conditions with shared state (visited URLs, data list). Stick to sequential scraping unless you implement careful thread locking and request queuing (see Article 9).

What if the site uses JavaScript to load more content dynamically?

Static pagination (links/numbered pages) works with BeautifulSoup. If content loads via JavaScript (infinite scroll), you need Playwright (Article 5) to render the page in a browser before parsing.

Further Reading