Syllabus Lesson 80 of 239 · Data Foundations: numpy & pandas
Data Foundations: numpy & pandas

Grouping and Merging

Two operations unlock most data analysis: groupby (split rows into groups and summarise each) and merge (join two tables on a shared key).

groupby follows a split-apply-combine pattern. Group by a column, then aggregate:

import pandas as pd
sales = pd.DataFrame({
    "region": ["N", "S", "N", "S"],
    "amount": [100, 200, 150, 50],
})
print(sales.groupby("region")["amount"].sum())
# region
# N    250
# S    250

Use .agg to compute several summaries at once:

print(sales.groupby("region")["amount"].agg(["sum", "mean", "count"]))

merge joins tables that share a column. Here both tables have region:

info = pd.DataFrame({
    "region":  ["N", "S"],
    "manager": ["Ada", "Bo"],
})
joined = sales.merge(info, on="region")
print(joined)   # each sales row gains its region's manager

The default is an inner join: only keys present in both tables survive. Pass how="left" to keep every row of the left table even when there is no match.

Your turn

Two DataFrames are given. First, group orders by customer and sum the total column, storing the result Series in by_customer. Second, merge orders with customers on the customer column (inner join) and store the merged DataFrame in merged.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output