Python Web Scraping Tutorial: Start Here
Web scraping is the automated extraction of structured data from websites by fetching their HTML and parsing it into usable information. Python dominates web scraping because of libraries like requests (HTTP), beautifulsoup4 (HTML parsing), and playwright (JavaScript rendering). In this introductory article, you will understand what scraping is, why it matters, and build your first working scraper that downloads a web page and extracts meaningful data in under 50 lines of code.
I started building web scrapers in 2015 to aggregate real estate listings across multiple sites. What began as shell scripts evolved into a Python-based system that now processes millions of data points monthly. The fundamentals have remained unchanged: fetch a page, parse the HTML, extract fields, store the data. Mastering these basics is the foundation for everything from simple hobby projects to enterprise data pipelines.
What Is Web Scraping and Why Does It Matter?
Web scraping is the automated, programmatic extraction of data from websites. A scraper sends an HTTP request to a web server, receives the HTML response, parses it using tools like BeautifulSoup, and extracts structured data (titles, prices, links, text, etc.) into formats like CSV or JSON. Unlike using a website's official API (if one exists), scraping works on any HTML page, making it flexible but requiring more careful design.
Real-world use cases include:
- Price monitoring: tracking competitor prices across e-commerce sites.
- Lead generation: extracting contact information from business directories.
- Research: collecting data for academic papers, market analysis, or trend analysis.
- Content aggregation: gathering news articles, job postings, or property listings.
- SEO analysis: monitoring search rankings, backlinks, or on-page metadata.
Web scraping differs from APIs: APIs are designed for programmatic access and provide structured data; scraping relies on parsing HTML, which is fragile to site redesigns but works when APIs are unavailable or behind authentication.
HTTP Requests: How Your Scraper Fetches a Page
The requests library in Python makes HTTP communication simple. It handles the low-level socket details and gives you clean, Pythonic methods to GET, POST, and process responses. Here is how a basic fetch works:
import requests
# Step 1: Send a GET request to a URL
response = requests.get("https://example.com")
# Step 2: Check if the request succeeded
if response.status_code == 200:
print("Success! Page fetched.")
print(f"Content length: {len(response.text)} characters")
else:
print(f"Error: Status code {response.status_code}")
# Step 3: Access the HTML as a string
html = response.text
print(html[:500]) # Print the first 500 characters
The status code 200 means success; 404 means not found; 403 means forbidden. Always check the status before processing the response. The response.text attribute gives you the HTML as a string.
Parsing HTML with BeautifulSoup
BeautifulSoup parses HTML into a traversable tree. You use CSS selectors or XPath-like navigation to find elements and extract text or attributes. Here is a realistic example that scrapes a list of articles:
from bs4 import BeautifulSoup
import requests
# Fetch the page
response = requests.get("https://example.com/blog")
html = response.text
# Parse the HTML
soup = BeautifulSoup(html, "html.parser")
# Find all article elements using a CSS selector
articles = soup.select("article.post")
# Extract data from each article
for article in articles:
# Use .select_one() to find the first matching element
title_elem = article.select_one("h2.post-title")
date_elem = article.select_one("span.post-date")
# Extract text from the element
title = title_elem.get_text(strip=True) if title_elem else "No title"
date = date_elem.get_text(strip=True) if date_elem else "No date"
# Extract an attribute (e.g., the href from a link)
link_elem = article.select_one("a.post-link")
url = link_elem.get("href") if link_elem else "No URL"
print(f"Title: {title}")
print(f"Date: {date}")
print(f"URL: {url}")
print("---")
The .select() method returns a list of elements matching a CSS selector. The .select_one() method returns the first match. The .get_text(strip=True) removes whitespace. The .get() method reads HTML attributes like href, class, or id.
Your First Scraper: A Complete Working Example
Let us build a scraper that downloads a real-world page (a public domain book listing) and extracts book titles, authors, and links. Here is a complete, runnable example:
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
def scrape_books():
"""Scrape a list of books and save to CSV."""
url = "https://www.gutenberg.org/ebooks/search/?query=shakespeare"
# Fetch the page
print(f"Fetching {url}...")
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise an error if status is not 200
# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")
# Find all book elements
books = soup.select("li.booklink")
print(f"Found {len(books)} books.")
# Prepare data for CSV
data = []
for book in books:
# Find elements within this book
title_elem = book.select_one("span.title")
author_elem = book.select_one("span.author")
link_elem = book.select_one("a")
title = title_elem.get_text(strip=True) if title_elem else "Unknown"
author = author_elem.get_text(strip=True) if author_elem else "Unknown"
link = link_elem.get("href") if link_elem else None
data.append({
"title": title,
"author": author,
"url": link,
"scraped_at": datetime.now().isoformat()
})
# Save to CSV
if data:
with open("books.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "author", "url", "scraped_at"])
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} books to books.csv")
else:
print("No books found.")
if __name__ == "__main__":
scrape_books()
Run this and you will see a CSV file with real book data. The key patterns: fetch the page with requests, parse it with BeautifulSoup, navigate the DOM with .select() and .select_one(), extract text with .get_text(), and save the results.
Key Concepts Every Scraper Must Know
| Concept | Meaning | Example |
|---|---|---|
| HTTP Status Code | Server's response to your request (200 = success, 404 = not found, 403 = forbidden) | response.status_code == 200 |
| User-Agent Header | Identifies your program to the server; many sites block requests without one | headers={"User-Agent": "Mozilla/5.0"} |
| CSS Selector | Pattern to find HTML elements (e.g., div.article h2) | soup.select("div.article") |
| Timeout | Maximum seconds to wait for a response before giving up | requests.get(url, timeout=10) |
| robots.txt | File that tells scrapers which pages they may access | Check /robots.txt before scraping |
Key Takeaways
- Web scraping automates extraction of data from websites by fetching HTML and parsing it into usable format.
- The
requestslibrary handles HTTP communication; BeautifulSoup parses HTML into a navigable tree. - Always check HTTP status codes and set timeouts to prevent hanging requests.
- CSS selectors are the primary tool for finding and isolating elements in the DOM.
- Start with simple, single-page scrapers before adding complexity like pagination, rate limiting, or dynamic content.
Frequently Asked Questions
Do I need to install any packages to start scraping?
Yes. Install requests and beautifulsoup4 using pip: pip install requests beautifulsoup4. These two packages cover 90% of static HTML scraping. Later articles will introduce playwright for JavaScript-heavy sites.
Is web scraping legal?
Scraping is legal in most jurisdictions when you scrape publicly available data, respect robots.txt and rate limits, and do not violate the site's terms of service. Never scrape personal data (email addresses, passwords, credit cards) or copyrighted content for redistribution. Always review the site's ToS.
How do I avoid getting blocked by a website?
Use appropriate headers (especially User-Agent), respect rate limiting (wait between requests), and check robots.txt. Many sites block scrapers aggressively; slowing down and identifying yourself as a scraper improves chances of access. Use proxy services or session management if you are scraping large datasets.
What is the difference between scraping and using an API?
APIs are official, structured interfaces designed for programmatic access. Scraping parses HTML, which is fragile to redesigns. Use APIs if available; scrape only when APIs are unavailable, behind paywall, or require authentication you lack.
How do I handle pages that load content with JavaScript?
Static HTML scraping with BeautifulSoup will not work. Use Playwright (covered in Article 5) to render JavaScript in a headless browser, wait for elements to load, and then parse the fully rendered HTML.
Further Reading
- Requests Documentation — Official guide to the requests library for HTTP in Python.
- BeautifulSoup 4 Documentation — Full reference for HTML parsing and DOM navigation.
- Web Scraping with Python (Book) — Ryan Mitchell's comprehensive guide to scraping techniques and best practices.
- robots.txt Specification — Official documentation on respecting site crawling rules.