Data Foundations: numpy & pandas

Cleaning Messy Data

Real tables have gaps and wrong types. pandas marks a missing value as NaN (Not a Number). You either fill it or drop it.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "name": ["Ada", "Bo", "Cy"],
    "age":  [30, np.nan, 35],
})
print(df["age"].fillna(0))    # replace NaN with 0
print(df.dropna())            # drop any row that has a NaN

A common trick is to fill missing numbers with the column average:

mean_age = df["age"].mean()        # NaN is ignored in the mean
df["age"] = df["age"].fillna(mean_age)

Change a column's type with astype, and rename columns with rename:

df["age"] = df["age"].astype(int)
df = df.rename(columns={"name": "first_name"})

To transform every value in a column, use apply with a function. It runs the function on each cell and returns a new Series:

def initial(s):
    return s[0]
df["letter"] = df["first_name"].apply(initial)
# you can also use a lambda:
df["letter"] = df["first_name"].apply(lambda s: s[0])

Your turn

The DataFrame below has a missing age (NaN). Fill the missing value in the age column with the column mean (ignoring NaN), then convert the whole age column to int. Finally add a new column name_len that holds the length of each name (use .apply with len).

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Cleaning Messy Data

This lesson is locked

Best on a laptop