Home » Working with NaN’s (nulls/NA’s) in pandas: per column, per row and per group

Working with NaN’s (nulls/NA’s) in pandas: per column, per row and per group

  • by
nan-pandas
Want to do a random act of kindness? Share this post.

Getting a firm understanding of NaNs in your dataset ensures you don’t draw wrong conclusions from an incomplete dataset. In this blog post I show how you can list the amount of NaNs per column, per row, and per group.

First, let’s create some dummy data, and add some NaNs.

import pandas as pd
from timeit import timeit
from random import randint, seed
import numpy as np

seed(1988)

taste = ['sweet','sour','sweet','bitter'] * 1000
color = ['red','green','yellow','red'] * 1000
fruit = ['apple','pear','banana','cherry'] * 1000

data = {'taste': taste, 'color': color, 'fruit': fruit}

df = pd.DataFrame(data)

xy_rand = [(randint(0,3999),randint(0,2)) for i in range(1,100)]
for xy in xy_rand:
    df.iloc[xy] = np.nan

Getting the number of NaN values per column is fairly straightforward. You can list all the NaN values (true/false) and then sum them. The difference between isnull() and isna()? isnull() and isna() are the same functions (an alias), so you can choose either one.

df.isnull().sum()
df.isna().sum()

Finding which index (or row number) contains missing values can be done analogously to the previous example, simply by adding axis=1. However, it will return all the rows. If you have a large dataset, you can filter it using .index.

df.isnull().sum(axis=1) # all indices
df.index[df.isnull().sum(axis=1) > 0] # only indices with NaN

Finally, you can check the amount of NaN’s per group (or class) as follows. In this example I want to find the number of missing values per color. It can be done for every column and also for one column specifically.

df.isnull().groupby(df.color, sort=False).sum() # all columns
df.taste.isnull().groupby(df.color, sort = False).sum() # 'taste' column only

Great success!

Want to do a random act of kindness? Share this post.