Getting a firm understanding of NaNs in your dataset ensures you don’t draw wrong conclusions from an incomplete dataset. In this blog post I show how you can list the amount of NaNs per column, per row, and per group.
First, let’s create some dummy data, and add some NaNs.
import pandas as pd
from timeit import timeit
from random import randint, seed
import numpy as np
seed(1988)
taste = ['sweet','sour','sweet','bitter'] * 1000
color = ['red','green','yellow','red'] * 1000
fruit = ['apple','pear','banana','cherry'] * 1000
data = {'taste': taste, 'color': color, 'fruit': fruit}
df = pd.DataFrame(data)
xy_rand = [(randint(0,3999),randint(0,2)) for i in range(1,100)]
for xy in xy_rand:
df.iloc[xy] = np.nan
Getting the number of NaN values per column is fairly straightforward. You can list all the NaN values (true/false) and then sum them. The difference between isnull() and isna()? isnull() and isna() are the same functions (an alias), so you can choose either one.
df.isnull().sum()
df.isna().sum()
Finding which index (or row number) contains missing values can be done analogously to the previous example, simply by adding axis=1. However, it will return all the rows. If you have a large dataset, you can filter it using .index.
df.isnull().sum(axis=1) # all indices
df.index[df.isnull().sum(axis=1) > 0] # only indices with NaN
Finally, you can check the amount of NaN’s per group (or class) as follows. In this example I want to find the number of missing values per color. It can be done for every column and also for one column specifically.
df.isnull().groupby(df.color, sort=False).sum() # all columns
df.taste.isnull().groupby(df.color, sort = False).sum() # 'taste' column only
Great success!