Pandas (Part 5): Grouping and aggregation
After you have cleaned rows and engineered columns (Part 4: Data cleaning), analysis usually turns toward patterns within groups: sales by region, error rates by server, averages by cohort. Pandas expresses that workflow with groupby, which splits data, applies a function, then combines results.
📚 Prerequisites
- Comfortable with Pandas Series and DataFrames.
- Completed the earlier lessons on indexing and preparation in this series.
🎯 What you'll master
- Use
groupbyto split a DataFrame using one or more key columns. - Apply aggregations (
sum,mean,count) and inspect the resulting shape. - Work with
aggfor multiple summaries per column andtransformwhen you need one value per row. - Pivot-style summaries with
pivot_tableat a beginner-friendly depth.
🧠 The split–apply–combine mental model
- Split rows into buckets keyed by categorical columns (
region,product_id). - Apply a reducing function per bucket—totals, means, custom lambdas.
- Combine bucket outputs into a new table or series.
import pandas as pd
df = pd.DataFrame(
{
"region": ["East", "East", "West", "West"],
"product": ["A", "B", "A", "B"],
"revenue": [10, 20, 15, 5],
}
)
df.groupby("region")["revenue"].sum()
The result is a Series indexed by region with summed revenue.
Multiple keys and multiple columns
df.groupby(["region", "product"])["revenue"].mean()
When you aggregate several value columns, select them before or after groupby:
df.groupby("region")[["revenue"]].agg(["sum", "mean"])
Named aggregations with agg
agg keeps intent readable when each column needs different reductions:
df.groupby("region").agg(total=("revenue", "sum"), avg=("revenue", "mean"))
You can also pass a dict of column → list of functions when you prefer compact syntax.
transform for per-row values
Sometimes you need a column aligned with the original index—for example, each row’s z-score within its group:
g = df.groupby("region")["revenue"]
df["rev_z"] = (df["revenue"] - g.transform("mean")) / g.transform("std")
This is how you avoid manual merges after aggregation when the result must stay row-aligned.
pivot_table for spreadsheet-style layouts
pd.pivot_table(df, values="revenue", index="region", columns="product", aggfunc="sum")
Useful when stakeholders expect a wide matrix instead of a long tidy table.
💡 Key takeaways
groupbyis the workhorse for “metric by dimension” questions in tabular data.- Choose between
aggregate/aggsummaries andtransformwhen every row needs a derived group statistic. - Name your aggregates explicitly (
total=…) so wide result frames stay understandable in code review.
➡️ Next steps
Visualization turns aggregates into narratives people can skim—jump to Introduction to data visualization.
Glossary
- Aggregation: A function that summarizes many rows (e.g., sum, median).
- Group key: Column(s) whose unique combinations define buckets.