Skip to main content

Pandas (Part 5): Grouping and aggregation

After you have cleaned rows and engineered columns (Part 4: Data cleaning), analysis usually turns toward patterns within groups: sales by region, error rates by server, averages by cohort. Pandas expresses that workflow with groupby, which splits data, applies a function, then combines results.


📚 Prerequisites

  • Comfortable with Pandas Series and DataFrames.
  • Completed the earlier lessons on indexing and preparation in this series.

🎯 What you'll master

  • Use groupby to split a DataFrame using one or more key columns.
  • Apply aggregations (sum, mean, count) and inspect the resulting shape.
  • Work with agg for multiple summaries per column and transform when you need one value per row.
  • Pivot-style summaries with pivot_table at a beginner-friendly depth.

🧠 The split–apply–combine mental model

  1. Split rows into buckets keyed by categorical columns (region, product_id).
  2. Apply a reducing function per bucket—totals, means, custom lambdas.
  3. Combine bucket outputs into a new table or series.
import pandas as pd

df = pd.DataFrame(
{
"region": ["East", "East", "West", "West"],
"product": ["A", "B", "A", "B"],
"revenue": [10, 20, 15, 5],
}
)

df.groupby("region")["revenue"].sum()

The result is a Series indexed by region with summed revenue.


Multiple keys and multiple columns

df.groupby(["region", "product"])["revenue"].mean()

When you aggregate several value columns, select them before or after groupby:

df.groupby("region")[["revenue"]].agg(["sum", "mean"])

Named aggregations with agg

agg keeps intent readable when each column needs different reductions:

df.groupby("region").agg(total=("revenue", "sum"), avg=("revenue", "mean"))

You can also pass a dict of column → list of functions when you prefer compact syntax.


transform for per-row values

Sometimes you need a column aligned with the original index—for example, each row’s z-score within its group:

g = df.groupby("region")["revenue"]
df["rev_z"] = (df["revenue"] - g.transform("mean")) / g.transform("std")

This is how you avoid manual merges after aggregation when the result must stay row-aligned.


pivot_table for spreadsheet-style layouts

pd.pivot_table(df, values="revenue", index="region", columns="product", aggfunc="sum")

Useful when stakeholders expect a wide matrix instead of a long tidy table.


💡 Key takeaways

  • groupby is the workhorse for “metric by dimension” questions in tabular data.
  • Choose between aggregate/agg summaries and transform when every row needs a derived group statistic.
  • Name your aggregates explicitly (total=…) so wide result frames stay understandable in code review.

➡️ Next steps

Visualization turns aggregates into narratives people can skim—jump to Introduction to data visualization.


Glossary

  • Aggregation: A function that summarizes many rows (e.g., sum, median).
  • Group key: Column(s) whose unique combinations define buckets.