Web scraping with requests and BeautifulSoup
HTTP pages are blobs of markup meant for browsers; automation wants structured facts—titles, tables, nightly prices. Combine requests (transport) with BeautifulSoup (parsing HTML) cautiously—respect robots.txt, terms of service, and rate limits introduced earlier in **Automation intro**.
pip install requests beautifulsoup4
📚 Prerequisites
- Basic understanding of URLs and HTTP verbs (GET).
🎯 What you'll master
- Issue GET requests with timeouts and sensible headers (
User-Agentidentifying yourself). - Navigate DOM trees via CSS selectors for stability.
Fetch + parse skeleton
import requests
from bs4 import BeautifulSoup
url = "https://example.org"
resp = requests.get(url, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
headline = soup.select_one("h1")
print(headline.get_text(strip=True) if headline else "missing")
Never assume status 200 implies valid HTML—log unexpected lengths.
Tables to rows
Many legacy intranet reports expose <table> blocks—iterate <tr> with .find_all("td") or reach for pandas.read_html downstream when permissible.
Ethics checklist
- Prefer official APIs from Working with APIs.
- Throttle politely (
time.sleep) and cache responses for development. - Attribute sources when republishing scraped metrics.
💡 Key takeaways
- Selectors referencing semantic classes degrade slower than brittle XPath copied from DevTools snapshots.
➡️ Next steps
Call structured endpoints in Working with APIs using requests.