Skip to main content

Store Scraped Data: CSV, JSON, and Databases

Scraping data is only half the battle; storing it reliably is the other half. You must choose a storage format (CSV, JSON, database), handle duplicates, validate schema, and prepare for failures mid-scrape. This article covers writing scraped data to CSV and JSON files, persisting to SQLite for structured queries, implementing deduplication to prevent duplicate records, and batch operations for efficient writes. You will build storage patterns that survive crashes, handle schema changes, and scale from hundreds to millions of records. By the end, your scraper will save data reliably with no data loss.

I once lost 10,000 records when my scraper crashed before writing to disk. Since then, I flush data to disk regularly, use transactions for database writes, and deduplicate aggressively. Reliable storage is as important as extraction.

Writing to CSV Files

CSV (Comma-Separated Values) is the simplest and most portable format. Python's csv module handles encoding and quoting:

import csv
from datetime import datetime

# Sample data (list of dictionaries)
data = [
{"title": "Laptop", "price": "$999", "url": "https://example.com/1"},
{"title": "Phone", "price": "$799", "url": "https://example.com/2"},
{"title": "Tablet", "price": "$599", "url": "https://example.com/3"},
]

# Write to CSV
def write_csv(filename, data, fieldnames=None):
"""Write data to CSV file."""

if not data:
print("No data to write")
return

# Infer fieldnames from first record if not provided
if not fieldnames:
fieldnames = list(data[0].keys())

try:
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader() # Write column names
writer.writerows(data) # Write rows

print(f"Wrote {len(data)} records to {filename}")

except IOError as e:
print(f"Error writing to {filename}: {e}")

# Append to existing CSV
def append_csv(filename, data, fieldnames=None):
"""Append data to existing CSV file."""

import os

if not data:
return

if not fieldnames:
fieldnames = list(data[0].keys())

file_exists = os.path.exists(filename)

try:
with open(filename, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)

# Write header only if file is new
if not file_exists:
writer.writeheader()

writer.writerows(data)

print(f"Appended {len(data)} records to {filename}")

except IOError as e:
print(f"Error appending to {filename}: {e}")

# Usage
write_csv("products.csv", data)
append_csv("products.csv", [{"title": "Monitor", "price": "$399", "url": "https://example.com/4"}])

CSV pros and cons:

ProsCons
Human-readableNo schema enforcement
Portable (Excel, Google Sheets)Fragile to special characters (quotes, commas)
Fast to writeNo relational queries
Small file sizeDifficult to update individual records

Writing to JSON Files

JSON preserves data types and handles nesting. Use json module for single writes or streaming for large datasets:

import json
import os

# Sample data
data = [
{"id": 1, "title": "Laptop", "price": 999, "tags": ["electronics", "computers"]},
{"id": 2, "title": "Phone", "price": 799, "tags": ["electronics", "mobile"]},
{"id": 3, "title": "Tablet", "price": 599, "tags": ["electronics", "mobile"]},
]

# Write to JSON
def write_json(filename, data, pretty=True):
"""Write data to JSON file."""

try:
with open(filename, "w", encoding="utf-8") as f:
if pretty:
json.dump(data, f, indent=2, ensure_ascii=False)
else:
json.dump(data, f, ensure_ascii=False)

print(f"Wrote {len(data)} records to {filename}")

except IOError as e:
print(f"Error writing to {filename}: {e}")

# Read from JSON
def read_json(filename):
"""Read data from JSON file."""

try:
with open(filename, "r", encoding="utf-8") as f:
data = json.load(f)

print(f"Read {len(data)} records from {filename}")
return data

except (IOError, json.JSONDecodeError) as e:
print(f"Error reading {filename}: {e}")
return []

# Append to JSON (read, append, write)
def append_json(filename, new_records):
"""Append records to JSON file."""

# Read existing data
existing = read_json(filename) if os.path.exists(filename) else []

# Append new records
existing.extend(new_records)

# Write back
write_json(filename, existing)

# Usage
write_json("products.json", data, pretty=True)
new_products = [{"id": 4, "title": "Monitor", "price": 399, "tags": ["electronics"]}]
append_json("products.json", new_products)

# Read back
products = read_json("products.json")
for product in products:
print(f"{product['title']}: ${product['price']}")

JSON pros and cons:

ProsCons
Human-readableSlower than CSV for large files
Preserves data typesRequires valid UTF-8
Supports nestingAppending requires read-modify-write
API-friendlyLarge JSON files can be memory-intensive

SQLite: Structured Storage with Queries

SQLite is a lightweight database (single file, no server) perfect for scrapers. Use it for deduplication, filtering, and complex queries:

import sqlite3
from datetime import datetime

class ScraperDatabase:
def __init__(self, db_file):
self.db_file = db_file
self.conn = sqlite3.connect(db_file)
self.conn.row_factory = sqlite3.Row # Access columns by name
self.create_table()

def create_table(self):
"""Create the products table if it doesn't exist."""

cursor = self.conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
price REAL,
tags TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
self.conn.commit()
print(f"Database table ready: {self.db_file}")

def insert_or_update(self, records):
"""Insert records, updating if URL already exists (upsert)."""

cursor = self.conn.cursor()
inserted = 0
updated = 0

for record in records:
try:
cursor.execute("""
INSERT INTO products (url, title, price, tags)
VALUES (?, ?, ?, ?)
""", (
record["url"],
record["title"],
record.get("price"),
record.get("tags")
))
inserted += 1

except sqlite3.IntegrityError:
# URL already exists; update it
cursor.execute("""
UPDATE products
SET title = ?, price = ?, tags = ?, updated_at = CURRENT_TIMESTAMP
WHERE url = ?
""", (
record["title"],
record.get("price"),
record.get("tags"),
record["url"]
))
updated += 1

self.conn.commit()
print(f"Inserted {inserted}, updated {updated} records")

def get_all(self):
"""Retrieve all products."""

cursor = self.conn.cursor()
cursor.execute("SELECT * FROM products ORDER BY scraped_at DESC")
return [dict(row) for row in cursor.fetchall()]

def get_recent(self, days=1):
"""Get products scraped in the last N days."""

cursor = self.conn.cursor()
cursor.execute("""
SELECT * FROM products
WHERE scraped_at > datetime('now', '-' || ? || ' days')
ORDER BY scraped_at DESC
""", (days,))
return [dict(row) for row in cursor.fetchall()]

def get_by_price_range(self, min_price, max_price):
"""Find products within a price range."""

cursor = self.conn.cursor()
cursor.execute("""
SELECT * FROM products
WHERE price BETWEEN ? AND ?
ORDER BY price ASC
""", (min_price, max_price))
return [dict(row) for row in cursor.fetchall()]

def count(self):
"""Get total product count."""

cursor = self.conn.cursor()
cursor.execute("SELECT COUNT(*) as count FROM products")
return cursor.fetchone()["count"]

def close(self):
"""Close the database connection."""
self.conn.close()

# Usage
db = ScraperDatabase("products.db")

# Insert new records
records = [
{"url": "https://example.com/1", "title": "Laptop", "price": 999, "tags": "electronics,computers"},
{"url": "https://example.com/2", "title": "Phone", "price": 799, "tags": "electronics,mobile"},
{"url": "https://example.com/3", "title": "Tablet", "price": 599, "tags": "electronics,mobile"},
]
db.insert_or_update(records)

# Query
print(f"Total products: {db.count()}")

print("\nAll products:")
for product in db.get_all():
print(f" {product['title']}: ${product['price']}")

print("\nProducts $500-$999:")
for product in db.get_by_price_range(500, 999):
print(f" {product['title']}: ${product['price']}")

db.close()

SQLite pros and cons:

ProsCons
Schema enforcementMore setup than CSV/JSON
ACID transactionsSingle-file format less portable
Efficient deduplicationRequires SQL knowledge
Complex queries (JOINs, aggregations)Not suitable for hierarchical data
Scales to millions of recordsSlower reads than in-memory data

Deduplication Strategies

Prevent duplicate records using unique constraints or checks:

import sqlite3

def deduplicate_by_url(db_file):
"""Remove duplicate records by keeping newest."""

conn = sqlite3.connect(db_file)
cursor = conn.cursor()

# Find and remove duplicates (keep the most recently updated)
cursor.execute("""
DELETE FROM products
WHERE id NOT IN (
SELECT MAX(id)
FROM products
GROUP BY url
)
""")

deleted = cursor.rowcount
conn.commit()
print(f"Deleted {deleted} duplicate records")
conn.close()

def deduplicate_json(filename):
"""Remove duplicates from JSON file by URL."""

import json

with open(filename, "r", encoding="utf-8") as f:
data = json.load(f)

# Track seen URLs
seen_urls = set()
unique_data = []

for record in data:
url = record.get("url")
if url not in seen_urls:
seen_urls.add(url)
unique_data.append(record)

# Write deduplicated data
with open(filename, "w", encoding="utf-8") as f:
json.dump(unique_data, f, indent=2, ensure_ascii=False)

print(f"Removed {len(data) - len(unique_data)} duplicates. Kept {len(unique_data)} records.")

# Usage
deduplicate_by_url("products.db")

Batch Operations for Efficient Writes

Write in batches rather than one record at a time to improve performance:

import sqlite3

def scrape_with_batching(start_url, batch_size=100):
"""Scrape and write to database in batches."""

db = sqlite3.connect("scraped_data.db")
cursor = db.cursor()

batch = []
page = 1

while page <= 10: # Example: 10 pages
# Scrape a page (pseudo-code)
records = scrape_page(start_url, page)

batch.extend(records)

# Write batch when size reached
if len(batch) >= batch_size:
cursor.executemany("""
INSERT OR REPLACE INTO products (url, title, price)
VALUES (?, ?, ?)
""", [(r["url"], r["title"], r["price"]) for r in batch])
db.commit()
print(f"Wrote batch of {len(batch)} records")
batch = []

page += 1

# Write remaining records
if batch:
cursor.executemany("""
INSERT OR REPLACE INTO products (url, title, price)
VALUES (?, ?, ?)
""", [(r["url"], r["title"], r["price"]) for r in batch])
db.commit()
print(f"Wrote final batch of {len(batch)} records")

db.close()

def scrape_page(url, page):
"""Placeholder for page scraping."""
return [
{"url": f"https://example.com/{i}", "title": f"Product {i}", "price": 100 + i * 10}
for i in range(20)
]

Batching reduces disk I/O and speeds up writes by 10-100x compared to per-record writes.

Key Takeaways

  • CSV is simplest for small datasets; JSON preserves data types; SQLite handles deduplication and complex queries.
  • Always validate data before writing (check for required fields, data types).
  • Use SQLite for production scrapers; implement UNIQUE constraints to prevent duplicates.
  • Batch writes (100-1000 records at a time) for efficiency.
  • Write to disk regularly (every batch) to prevent data loss if scraper crashes.

Frequently Asked Questions

What format should I choose for my scraper output?

CSV for spreadsheets and exports; JSON for APIs and structured data; SQLite for deduplication, queries, and production systems. Most projects use SQLite for scraping and export to CSV/JSON for sharing.

How do I handle schema changes (adding a new field)?

In SQLite, add a column: ALTER TABLE products ADD COLUMN new_field TEXT DEFAULT ''. In CSV/JSON, just start including the field in new records; re-process old records to backfill if needed.

Can I append to a SQLite database safely?

Yes. Use transactions (BEGIN; ... COMMIT;) to ensure atomicity. If the scraper crashes mid-transaction, the database rolls back automatically.

How large can a SQLite database get?

SQLite can handle databases up to 281 TB on systems with 64-bit long integers. For most scrapers, 10GB+ is more than enough. Index frequently-queried columns for speed.

Should I deduplicate before or after writing?

Both. Use UNIQUE constraints on URLs (preventing inserts) and periodically run deduplication to remove historical duplicates. Deduplicate before exporting to CSV/JSON.

Further Reading