Cleaning Messy Data
Real tables have gaps and wrong types. pandas marks a missing value as NaN (Not a Number). You either fill it or drop it.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Ada", "Bo", "Cy"],
"age": [30, np.nan, 35],
})
print(df["age"].fillna(0)) # replace NaN with 0
print(df.dropna()) # drop any row that has a NaNA common trick is to fill missing numbers with the column average:
mean_age = df["age"].mean() # NaN is ignored in the mean
df["age"] = df["age"].fillna(mean_age)Change a column's type with astype, and rename columns with rename:
df["age"] = df["age"].astype(int)
df = df.rename(columns={"name": "first_name"})To transform every value in a column, use apply with a function. It runs the function on each cell and returns a new Series:
def initial(s):
return s[0]
df["letter"] = df["first_name"].apply(initial)
# you can also use a lambda:
df["letter"] = df["first_name"].apply(lambda s: s[0])The DataFrame below has a missing age (NaN). Fill the missing value in the age column with the column mean (ignoring NaN), then convert the whole age column to int. Finally add a new column name_len that holds the length of each name (use .apply with len).
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.