Handle Dynamic Pages with Playwright: JavaScript Rendering
Not all websites serve complete HTML on the initial page load. Modern single-page applications (SPAs), infinite-scroll sites, and JavaScript-heavy pages render content dynamically in the browser using frameworks like React, Vue, and Angular. BeautifulSoup can only parse static HTML; it cannot execute JavaScript. This is where Playwright enters the picture. Playwright automates a headless browser, renders JavaScript, waits for dynamic content to appear, and lets you interact with the page as a user would. This article teaches you to deploy Playwright for scraping modern web applications, handle asynchronous loading, and gracefully fall back to simpler approaches when full browser automation is overkill.
I spent weeks trying to scrape a React-based job board using BeautifulSoup before realizing the entire page was rendered client-side. Switching to Playwright and adding a wait for dynamic elements solved it in hours. For modern websites, Playwright is non-negotiable.
Installing and Starting Playwright
Playwright is a cross-platform browser automation framework. Install it and the necessary browser binaries:
# Install via pip
# pip install playwright
# Then download browser binaries (one-time setup)
# python -m playwright install
# Or install a specific browser
# python -m playwright install chromium
from playwright.sync_api import sync_playwright
import time
# Context manager automatically closes browser
with sync_playwright() as p:
# Launch a headless Chromium browser
browser = p.chromium.launch(headless=True)
# Create a new page/tab
page = browser.new_page()
# Navigate to a URL
page.goto("https://example.com")
# Get the page title
print(page.title())
# Get the rendered HTML (after JavaScript execution)
html = page.content()
print(f"Page size: {len(html)} characters")
# Close the page and browser
page.close()
browser.close()
Key points:
sync_playwright()is the synchronous API (easier for beginners);async_playwright()is async.headless=Trueruns the browser without a visible UI (faster).page.goto()loads a URL and waits for the page to be interactive by default.page.content()returns the fully rendered HTML after JavaScript executes.
Waiting for Dynamic Content
JavaScript often loads content asynchronously. Playwright provides multiple waiting strategies:
from playwright.sync_api import sync_playwright
import time
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # headless=False to see what happens
page = browser.new_page()
page.goto("https://example.com/dynamic-list")
# Strategy 1: Wait for a specific selector to appear
try:
# Wait up to 10 seconds for elements with class "item" to appear
page.wait_for_selector("div.item", timeout=10000)
print("Items loaded!")
except:
print("Items did not load within timeout")
# Strategy 2: Wait for a function to return True
page.wait_for_function(
"() => document.querySelectorAll('div.item').length > 5",
timeout=10000
)
print("At least 5 items are now visible")
# Strategy 3: Wait for navigation (after clicking a link)
page.click("a.next-page")
page.wait_for_navigation()
print("Navigation completed")
# Strategy 4: Wait a fixed time (use sparingly; prefer above strategies)
time.sleep(2)
# Now parse the rendered HTML with BeautifulSoup
from bs4 import BeautifulSoup
html = page.content()
soup = BeautifulSoup(html, "html.parser")
items = soup.select("div.item")
for item in items:
title = item.select_one("h3")
if title:
print(title.get_text(strip=True))
browser.close()
Waiting strategies:
| Strategy | Use Case |
|---|---|
wait_for_selector(selector) | Wait for an element to appear |
wait_for_function(js_function) | Wait for custom JavaScript condition |
wait_for_navigation() | Wait for page navigation to complete |
wait_for_load_state("networkidle") | Wait for network to finish |
time.sleep(seconds) | Fixed delay (use as fallback) |
Scraping Infinite Scroll Pages
Pages that load more content as you scroll require scrolling and waiting between loads:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import time
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/infinite-scroll")
all_items = []
previous_count = 0
# Scroll multiple times to load more content
for scroll_iteration in range(5):
# Scroll to the bottom
page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
# Wait for new items to load (up to 5 seconds)
page.wait_for_function(
f"() => document.querySelectorAll('div.item').length > {previous_count}",
timeout=5000
)
# Parse the current page content
soup = BeautifulSoup(page.content(), "html.parser")
items = soup.select("div.item")
current_count = len(items)
print(f"Iteration {scroll_iteration + 1}: Found {current_count} items total")
previous_count = current_count
# Small delay before next scroll
time.sleep(1)
# Extract final data
soup = BeautifulSoup(page.content(), "html.parser")
for item in soup.select("div.item"):
title = item.select_one("h3")
if title:
all_items.append(title.get_text(strip=True))
print(f"Total items: {len(all_items)}")
browser.close()
This pattern scrolls, waits for new items, and repeats until no more content loads.
Interacting with Pages: Clicks, Forms, and Input
Playwright can fill forms, click buttons, and trigger interactions:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/search")
# Fill a text input
page.fill("input#search-box", "python web scraping")
# Click a button
page.click("button#search-button")
# Wait for results to load
page.wait_for_selector("div.result", timeout=5000)
# Read the page content
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content(), "html.parser")
results = soup.select("div.result")
print(f"Found {len(results)} search results")
# Select a dropdown option
page.select_option("select#category", "articles")
# Wait for new results
page.wait_for_load_state("networkidle")
# Take a screenshot for debugging
page.screenshot(path="screenshot.png")
browser.close()
Common interactions:
page.fill(selector, text)— fill a text input.page.click(selector)— click an element.page.select_option(selector, value)— select dropdown option.page.check(selector)— check a checkbox.page.wait_for_load_state("networkidle")— wait for all network requests.page.screenshot(path)— capture a screenshot.
A Complete Dynamic Scraper Example
Here is a realistic scraper that handles a React-based product listing:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
import time
class DynamicScraper:
def __init__(self, start_url):
self.start_url = start_url
self.data = []
def scrape(self):
with sync_playwright() as p:
# Launch browser
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Set a user agent
page.set_extra_http_headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
page.goto(self.start_url, wait_until="networkidle")
# Wait for initial product list
try:
page.wait_for_selector("div.product-card", timeout=10000)
except:
print("Products did not load")
browser.close()
return []
# Scroll and load all products
for iteration in range(10):
# Scroll to bottom
page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
# Wait a bit for content to load
time.sleep(1)
# Check if we can scroll further
can_scroll = page.evaluate("""
() => {
return window.innerHeight + window.scrollY < document.body.offsetHeight;
}
""")
if not can_scroll:
print("Reached the end of the page")
break
# Parse the final HTML
soup = BeautifulSoup(page.content(), "html.parser")
for card in soup.select("div.product-card"):
try:
title = card.select_one("h2").get_text(strip=True)
price = card.select_one("span.price").get_text(strip=True)
link = card.select_one("a").get("href")
self.data.append({
"title": title,
"price": price,
"link": link
})
except AttributeError:
# Missing elements; skip
continue
browser.close()
return self.data
# Usage
scraper = DynamicScraper("https://example.com/products")
products = scraper.scrape()
print(f"Scraped {len(products)} products")
with open("products.json", "w", encoding="utf-8") as f:
json.dump(products, f, indent=2)
Playwright Headless Mode and Performance
Headless browsers are faster but less visible for debugging. You can toggle and take screenshots:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
# headless=False shows the browser window (useful for debugging)
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# Set viewport size (affects responsive design)
page.set_viewport_size({"width": 1920, "height": 1080})
page.goto("https://example.com")
# Take a screenshot for visual inspection
page.screenshot(path="desktop.png")
# Resize for mobile testing
page.set_viewport_size({"width": 375, "height": 812})
page.screenshot(path="mobile.png")
browser.close()
When NOT to Use Playwright
Playwright is powerful but slower than BeautifulSoup. Use it only when necessary:
- Sites with JavaScript-rendered content: use Playwright.
- Sites with static HTML and next-page links: use BeautifulSoup.
- APIs that return JSON: use
requestsandjson.loads(). - Large-scale scraping (1M+ pages): optimize with BeautifulSoup or APIs first; add Playwright only if needed.
Key Takeaways
- Playwright automates headless browsers and executes JavaScript, enabling scraping of SPAs and dynamic content.
- Always wait for elements to load before parsing; use
wait_for_selector()orwait_for_function(). - Infinite-scroll pages require scrolling loops with waits between iterations.
- Playwright can fill forms, click buttons, and interact with pages like a user.
- Use
headless=Falseand screenshots for debugging dynamic content issues.
Frequently Asked Questions
What is the performance difference between Playwright and BeautifulSoup?
Playwright is 10-50x slower because it runs a full browser. For a single page, it takes 2-5 seconds; BeautifulSoup takes 0.1-0.5 seconds. Use Playwright only when JavaScript rendering is essential.
Can I use Playwright with multiple browser instances in parallel?
Yes, but carefully. Multiple browsers consume significant memory and CPU. Limit to 2-4 parallel instances and use a connection pool to manage them. For massive parallelism, consider headless browser services like Browserless or ScraperAPI.
How do I handle authentication (login) with Playwright?
Fill and submit login forms using page.fill() and page.click(), then wait for navigation. Optionally, save cookies with context.storage_state() to reuse across requests without re-logging-in.
What if a page has infinite scroll that never ends?
Set a maximum iteration count or monitor the number of new items loaded. If previous_count == current_count for two iterations, stop scrolling (no new content loaded).
Can I run Playwright on a headless server without a display?
Yes. Playwright works on servers and CI/CD systems without X11. Use headless=True (the default). Ensure browser binaries are installed: python -m playwright install.
Further Reading
- Playwright Python Documentation — Official API reference and tutorials.
- Playwright: Waiting for Elements — In-depth guide to wait strategies.
- Web Scraping with Playwright — Advanced patterns and best practices.
- Puppeteer (Node.js equivalent) — Reference for similar capabilities in JavaScript.