Skip to main content

Web scraping with requests and BeautifulSoup

HTTP pages are blobs of markup meant for browsers; automation wants structured facts—titles, tables, nightly prices. Combine requests (transport) with BeautifulSoup (parsing HTML) cautiously—respect robots.txt, terms of service, and rate limits introduced earlier in **Automation intro**.

pip install requests beautifulsoup4

📚 Prerequisites

  • Basic understanding of URLs and HTTP verbs (GET).

🎯 What you'll master

  • Issue GET requests with timeouts and sensible headers (User-Agent identifying yourself).
  • Navigate DOM trees via CSS selectors for stability.

Fetch + parse skeleton

import requests
from bs4 import BeautifulSoup

url = "https://example.org"
resp = requests.get(url, timeout=10)
resp.raise_for_status()

soup = BeautifulSoup(resp.text, "html.parser")
headline = soup.select_one("h1")
print(headline.get_text(strip=True) if headline else "missing")

Never assume status 200 implies valid HTML—log unexpected lengths.


Tables to rows

Many legacy intranet reports expose <table> blocks—iterate <tr> with .find_all("td") or reach for pandas.read_html downstream when permissible.


Ethics checklist

  1. Prefer official APIs from Working with APIs.
  2. Throttle politely (time.sleep) and cache responses for development.
  3. Attribute sources when republishing scraped metrics.

💡 Key takeaways

  • Selectors referencing semantic classes degrade slower than brittle XPath copied from DevTools snapshots.

➡️ Next steps

Call structured endpoints in Working with APIs using requests.