Robots.txt Compliance and Web Scraping Ethics
Web scraping power comes with ethical and legal responsibility. Most websites include a robots.txt file (e.g., example.com/robots.txt) that specifies which paths may be crawled. Additionally, every site has Terms of Service that govern automated access. Ignoring these guidelines exposes you to legal action, IP bans, and reputational damage. This article teaches you how to parse and respect robots.txt, understand Terms of Service, assess whether a site welcomes scraping, and make ethical decisions about data usage. The goal is to build scrapers that extract value while maintaining integrity and compliance with the site's rules.
I once scraped a site's entire customer database without reading their ToS. Within hours, I received a cease-and-desist letter. Since then, I review robots.txt and ToS before every project. Compliance costs nothing and saves you from legal headaches.
Understanding robots.txt Format and Rules
The robots.txt file is a text file in the website root that tells crawlers (including your scraper) which paths they may access. Here is a typical example:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Allow: /public/data/
User-agent: Googlebot
Disallow: /private/
Crawl-delay: 10
Request-rate: 30/1h
Key directives:
- User-agent: Which crawlers this rule applies to.
*means all;Googlebotis specific. - Disallow: Paths the crawler should not access.
/admin/disallows everything under/admin/. - Allow: Paths the crawler may access (overrides Disallow for specific paths).
- Crawl-delay: Minimum seconds between requests (deprecated; use Request-rate).
- Request-rate: Requests per time unit (e.g.,
30/1h= 30 requests per hour). - Sitemap: URL of the XML sitemap (for crawlers, not your scraper).
Parsing robots.txt in Python
Python has a built-in urllib.robotparser module to parse and check rules:
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin
def check_robots_txt(domain, user_agent="MyBot/1.0"):
"""Check if a URL can be accessed according to robots.txt."""
# Create a RobotFileParser instance
rp = RobotFileParser()
# Load robots.txt from the domain
robots_url = urljoin(domain, "/robots.txt")
print(f"Fetching {robots_url}")
rp.set_url(robots_url)
try:
rp.read()
except Exception as e:
print(f"Could not read robots.txt: {e}")
# If robots.txt doesn't exist, assume all paths are allowed
return True
# Check specific paths
paths_to_check = [
"/",
"/blog/",
"/blog/post-1",
"/admin/",
"/api/users"
]
for path in paths_to_check:
url = urljoin(domain, path)
can_fetch = rp.can_fetch(user_agent, url)
status = "ALLOWED" if can_fetch else "DISALLOWED"
print(f"{path}: {status}")
# Get the crawl delay for this user agent
delay = rp.request_rate(user_agent)
if delay:
print(f" Crawl delay: {delay.requests} requests per {delay.seconds} seconds")
return rp
# Usage
domain = "https://example.com"
rp = check_robots_txt(domain, user_agent="MyBot/1.0")
Output:
Fetching https://example.com/robots.txt
/: ALLOWED
/blog/: ALLOWED
/blog/post-1: ALLOWED
/admin/: DISALLOWED
/api/users: DISALLOWED
Always parse robots.txt before scraping. Even if a path is not explicitly disallowed, the robots.txt may recommend a crawl delay.
Implementing robots.txt Compliance in Your Scraper
Here is a scraper class that respects robots.txt:
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
import time
class EthicalScraper:
def __init__(self, domain, user_agent="MyBot/1.0"):
self.domain = domain
self.user_agent = user_agent
self.session = requests.Session()
# Parse robots.txt
self.rp = RobotFileParser()
robots_url = urljoin(domain, "/robots.txt")
self.rp.set_url(robots_url)
try:
self.rp.read()
print(f"Successfully loaded robots.txt from {robots_url}")
except Exception as e:
print(f"Could not load robots.txt: {e}. Proceeding with caution.")
# If robots.txt fails, assume conservative rules
self.rp = None
self.last_request_time = None
self.min_delay = 2 # Default 2 seconds
# Get recommended delay from robots.txt
if self.rp:
rate = self.rp.request_rate(user_agent)
if rate:
# Convert to per-second delay
self.min_delay = rate.seconds / rate.requests
print(f"robots.txt recommends {self.min_delay:.2f}s between requests")
def can_fetch(self, url):
"""Check if robots.txt allows fetching this URL."""
if not self.rp:
# No robots.txt; assume allowed
return True
can = self.rp.can_fetch(self.user_agent, url)
if not can:
print(f"robots.txt disallows: {url}")
return can
def fetch(self, url):
"""Fetch a URL, respecting robots.txt and rate limits."""
# Check robots.txt
if not self.can_fetch(url):
raise PermissionError(f"robots.txt disallows: {url}")
# Rate limiting
if self.last_request_time:
elapsed = time.time() - self.last_request_time
sleep_time = self.min_delay - elapsed
if sleep_time > 0:
print(f"Waiting {sleep_time:.2f}s to respect rate limit...")
time.sleep(sleep_time)
self.last_request_time = time.time()
# Fetch
headers = {"User-Agent": self.user_agent}
print(f"Fetching {url}")
response = self.session.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response
# Usage
scraper = EthicalScraper("https://example.com", user_agent="MyBot/1.0")
urls = [
"https://example.com/blog",
"https://example.com/blog/post-1",
# "https://example.com/admin/", # This would be blocked by robots.txt
]
for url in urls:
try:
response = scraper.fetch(url)
soup = BeautifulSoup(response.text, "html.parser")
print(f" Extracted data\n")
except PermissionError as e:
print(f" Blocked: {e}\n")
Understanding Terms of Service and Legal Boundaries
robots.txt covers technical crawling rules; the Terms of Service (ToS) governs legal rights. Always review the ToS before scraping:
def review_terms_of_service(domain):
"""Check common ToS clauses that restrict scraping."""
print(f"CHECKLIST FOR {domain}")
print("=" * 50)
checks = {
"1. Visit /terms or /legal page": "Look for ToS text",
"2. Search for 'scraping' in ToS": "Does ToS prohibit scraping?",
"3. Search for 'automated access' in ToS": "Are bots explicitly forbidden?",
"4. Search for 'copy' or 'reproduce' restrictions": "Can you reproduce content?",
"5. Search for 'commercial use' restrictions": "Can you use data commercially?",
"6. Check for copyright notice": "Is content copyrighted?",
"7. Look for GDPR/privacy terms": "Is personal data protected?",
"8. Check /robots.txt": "Are crawling rules specified?",
"9. Look for API documentation": "Does the site provide an API?",
"10. Email admin@domain for permission": "Ask if uncertain",
}
for check, detail in checks.items():
print(f"{check}")
print(f" -> {detail}")
print()
# Usage
review_terms_of_service("example.com")
Key legal considerations:
- Personal data (GDPR, CCPA): Scraping email addresses, phone numbers, or other PII is illegal without consent.
- Copyrighted content: Reproducing full articles or creative works without permission is copyright infringement.
- Commercial use: Some ToS allow scraping for personal use but prohibit commercial resale of data.
- Rate limits: Overwhelming a server with requests may violate Computer Fraud and Abuse Act (CFAA) in the US.
- Authentication bypass: Scraping behind a login wall (without authorization) is illegal in most jurisdictions.
When to Scrape and When to Seek Permission
Use this decision tree:
def should_scrape(domain, data_type, use_case):
"""Assess whether scraping is appropriate."""
print(f"SCRAPING ASSESSMENT FOR {domain}")
print(f"Data type: {data_type}")
print(f"Use case: {use_case}")
print()
# Check 1: Is there an API?
print("1. Does the site have a public API?")
print(" -> If yes, use the API instead. It's faster, more reliable, and legal.")
print()
# Check 2: Does robots.txt allow it?
print("2. Does robots.txt allow crawling?")
print(" -> If no, you should not scrape (unless you have explicit permission).")
print()
# Check 3: Does ToS allow it?
print("3. Does ToS allow automated access?")
print(" -> If no or unclear, request permission from the site owner.")
print()
# Check 4: Is the data public?
print("4. Is the data publicly visible to non-authenticated users?")
print(" -> If no, scraping is likely illegal.")
print()
# Check 5: What will you do with the data?
print(f"5. Will you {use_case}?")
if "commercial" in use_case.lower():
print(" -> Commercial use may violate ToS. Request permission.")
elif "personal" in use_case.lower():
print(" -> Personal use is often tolerated if you respect rate limits.")
else:
print(" -> Clarify your intent before proceeding.")
print()
# Final recommendation
print("RECOMMENDATION:")
print("- If data has an API: use API.")
print("- If robots.txt disallows: request permission or don't scrape.")
print("- If ToS unclear: email the admin for permission.")
print("- If scraping: respect rate limits, use User-Agent header, follow robots.txt.")
# Usage
should_scrape(
"example.com",
data_type="Product listings",
use_case="Commercial price comparison site"
)
Ethical Scraping Checklist
Before launching any scraper, verify:
-
robots.txtdoes not explicitly disallow the paths. - ToS does not prohibit automated access or your intended use case.
- You are not scraping personal data (PII, emails, phone numbers).
- You are not scraping copyrighted content for redistribution.
- You are respecting rate limits (minimum 1-2 second delays).
- You are setting a descriptive User-Agent header (not lying about your identity).
- You are not bypassing authentication or paywalls.
- You have considered whether an API exists.
- You are prepared to stop if the site sends a cease-and-desist letter.
Key Takeaways
- Always parse and respect
robots.txtbefore scraping; useurllib.robotparser. - Review the site's Terms of Service to understand what scraping is permitted.
- Do not scrape personal data (PII) without explicit consent.
- Do not reproduce copyrighted content without permission.
- When in doubt, email the site admin and ask for permission.
- Ethical scraping is sustainable scraping; respect the site's resources and rules.
Frequently Asked Questions
What if a site does not have robots.txt?
Assume conservative rules: a 2-3 second delay between requests and no scraping without explicit permission. The absence of robots.txt is not an invitation to scrape freely.
Can I scrape a site that has "data scrapers beware" in robots.txt?
Technically yes, but the site is explicitly warning you. If you proceed, you accept the legal and technical risk. Better approach: email the admin for permission or use their API.
Is scraping behind a login wall illegal?
Yes, in most jurisdictions. If a site requires authentication, scraping the authenticated content without permission violates the Computer Fraud and Abuse Act (US) and similar laws elsewhere.
What should I do if I receive a cease-and-desist letter?
Stop immediately. Contact a lawyer. Do not continue scraping, even from a different IP. Defying a cease-and-desist can result in civil or criminal liability.
Can I scrape for academic research?
Academic research is a gray area. Check the site's ToS and researcher agreements (e.g., Institutional Review Board approval). When in doubt, ask the site admin. Many sites grant permission for academic use.
Further Reading
- robots.txt Specification — Official standard and best practices for robots.txt.
- urllib.robotparser Documentation — Python reference for parsing robots.txt.
- Web Scraping and the Law (Stanford Internet Observatory) — Legal analysis of scraping in US and international law.
- GDPR and Web Scraping — European Union personal data protection rules.