Advanced SQLAlchemy Queries: Joins, Aggregates & Subqueries
Complex queries combining multiple tables, aggregations, and subqueries are essential for data analysis and reporting. SQLAlchemy provides a powerful API to build these queries without writing raw SQL, offering both readability and safety.
Understanding Joins
A join combines rows from two or more tables based on related columns. SQLAlchemy supports INNER JOIN (default), LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN. Use .join() to combine tables and query through relationships:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session
from models import Author, Book
from sqlalchemy.orm import joinedload
engine = create_engine("sqlite:///library.db")
with Session(engine) as session:
# INNER JOIN: only authors with at least one book
authors_with_books = session.query(Author).join(Author.books).all()
# LEFT OUTER JOIN: all authors, with books if they have any
all_authors = session.query(Author).outerjoin(Author.books).all()
# Join through a foreign key column
books_by_austen = session.query(Book).join(Author).filter(
Author.name == "Jane Austen"
).all()
The .join() method (default INNER JOIN) fetches only authors with books. The .outerjoin() method (LEFT OUTER JOIN) fetches all authors, even those without books. After a join, you can filter on either table's columns.
Aggregate Functions
Aggregates like COUNT, SUM, AVG, MIN, and MAX summarize data across rows. Use func from SQLAlchemy to call database functions:
from sqlalchemy import func
with Session(engine) as session:
# Count total books
total_books = session.query(func.count(Book.id)).scalar()
# Sum a numeric column
total_sales = session.query(func.sum(Book.price)).scalar()
# Average rating
avg_rating = session.query(func.avg(Book.rating)).scalar()
# Min and max values
min_price = session.query(func.min(Book.price)).scalar()
max_price = session.query(func.max(Book.price)).scalar()
# Get both count and average in one query
stats = session.query(
func.count(Book.id).label("total"),
func.avg(Book.price).label("avg_price")
).one()
print(f"Total books: {stats.total}, Average price: ${stats.avg_price:.2f}")
The func object provides database functions (COUNT, SUM, etc.). Use .scalar() to fetch a single aggregated value. Use .label() to name the result for easier access.
GROUP BY and Aggregation
Group rows by a column and aggregate within each group:
# Count books per author
author_book_counts = session.query(
Author.name,
func.count(Book.id).label("book_count")
).join(Author.books).group_by(Author.id).all()
for name, count in author_book_counts:
print(f"{name}: {count} books")
# Get authors with more than 5 books
prolific = session.query(
Author.name,
func.count(Book.id).label("book_count")
).join(Author.books).group_by(Author.id).having(
func.count(Book.id) > 5
).all()
# Average price per genre
genre_prices = session.query(
Book.genre,
func.avg(Book.price).label("avg_price")
).group_by(Book.genre).all()
The .group_by() method collects rows by a column. The .having() method filters grouped results (like WHERE for groups). Without .group_by(), aggregates operate on the entire result set.
Subqueries
A subquery is a query nested inside another query. Use subqueries for complex filtering, such as finding records matching a derived value:
from sqlalchemy.orm import aliased
# Find books with above-average price
average_price_subquery = session.query(func.avg(Book.price)).scalar_subquery()
expensive_books = session.query(Book).filter(
Book.price > average_price_subquery
).all()
# Find authors with book count above the median
# (Subquery to get the median)
book_counts = session.query(
Author.id,
func.count(Book.id).label("count")
).join(Author.books).group_by(Author.id).subquery()
median_subquery = session.query(
func.avg(book_counts.c.count)
).scalar_subquery()
prolific_authors = session.query(Author).join(
book_counts, Author.id == book_counts.c.id
).filter(book_counts.c.count > median_subquery).all()
The .subquery() method converts a query into a subquery object. .scalar_subquery() wraps a single-value subquery for use in comparisons. Subqueries are slower than equivalent joins but are more readable for complex logic.
Multiple Joins and Complex Filters
Combine multiple joins to query through multiple relationships:
# Find books published by authors from a specific country
# Assuming Author has a country column
uk_author_books = session.query(Book).join(
Author, Book.author_id == Author.id
).filter(Author.country == "UK").all()
# Many-to-many join: find authors who have written in multiple genres
author_genres = session.query(
Author.name,
func.count(func.distinct(Book.genre)).label("genre_count")
).join(Book).group_by(Author.id).having(
func.count(func.distinct(Book.genre)) > 1
).all()
Chain .join() calls to traverse multiple tables. Use func.distinct() inside aggregates to count unique values.
Window Functions (Advanced)
Window functions compute values across a set of rows without grouping them. Use them for ranking, running totals, and comparisons:
from sqlalchemy import func, over
# Rank books by price within each genre
rank_in_genre = session.query(
Book.title,
Book.genre,
Book.price,
func.rank().over(
partition_by=Book.genre,
order_by=Book.price.desc()
).label("rank")
).all()
for title, genre, price, rank in rank_in_genre:
print(f"{title} ({genre}): ${price} - Rank {rank}")
Window functions are powerful but require database support (PostgreSQL, MySQL 8.0+, not SQLite). See your database docs for available functions.
UNION Queries
Combine results from multiple queries:
from sqlalchemy import union
# Find all books authored by "Austen" OR published before 1950
old_or_austen = union(
session.query(Book).join(Author).filter(Author.name == "Jane Austen"),
session.query(Book).filter(Book.published_year < 1950)
).all()
.union() removes duplicates; use .union_all() to keep them.
Performance Considerations
Complex queries can be slow. Use EXPLAIN to inspect the generated SQL:
# See the SQL generated by SQLAlchemy (with echo=True in engine)
query = session.query(Author).join(Author.books).filter(...)
print(query) # Shows the SQL
# Use only() to select specific columns (reduces memory/network)
lightweight_authors = session.query(Author.id, Author.name).all()
# Use exists() for existence checks instead of fetching full rows
from sqlalchemy import exists
has_books = session.query(
exists().where(Book.author_id == Author.id)
).filter(Author.id == 1).scalar()
Key Takeaways
.join()(INNER) and.outerjoin()(LEFT OUTER) combine tables through relationships.- Aggregate functions (
func.count(),func.sum(),func.avg()) summarize data. .group_by()groups rows;.having()filters groups.- Subqueries nest one query inside another for complex filtering.
- Window functions compute values across rows without grouping (advanced, database-specific).
Frequently Asked Questions
When should I use a join vs. a subquery?
Joins are faster and more straightforward when combining tables on a relationship. Subqueries are useful for derived values (averages, medians, complex logic). Joins are generally preferred when possible.
What's the difference between .join() and .outerjoin()?
.join() (INNER) returns only rows with matches on both sides. .outerjoin() (LEFT OUTER) returns all rows from the left table, even without matches. Use INNER when both tables must have related records; use LEFT OUTER when you want all left records regardless.
How do I count distinct values?
Use func.count(func.distinct(column)):
unique_authors = session.query(
func.count(func.distinct(Book.author_id))
).scalar()
Can I aggregate without grouping?
Yes. Aggregates without .group_by() operate on the entire result set:
total_books = session.query(func.count(Book.id)).scalar()
What's the difference between .one(), .first(), and .scalar()?
.one() expects exactly one result and raises an error otherwise. .first() returns one result or None. .scalar() returns one column of one row (for single aggregates).