Working with NaN's (nulls/NA's) in pandas: per column, per row and per group

Getting a firm understanding of NaNs in your dataset ensures you don’t draw wrong conclusions from an incomplete dataset. In this blog post I show how you can list the amount of NaNs per column, per row, and per group.

First, let’s create some dummy data, and add some NaNs.

import pandas as pd
from timeit import timeit
from random import randint, seed
import numpy as np

seed(1988)

taste = ['sweet','sour','sweet','bitter'] * 1000
color = ['red','green','yellow','red'] * 1000
fruit = ['apple','pear','banana','cherry'] * 1000

data = {'taste': taste, 'color': color, 'fruit': fruit}

df = pd.DataFrame(data)

xy_rand = [(randint(0,3999),randint(0,2)) for i in range(1,100)]
for xy in xy_rand:
    df.iloc[xy] = np.nan

Getting the number of NaN values per column is fairly straightforward. You can list all the NaN values (true/false) and then sum them. The difference between isnull() and isna()? isnull() and isna() are the same functions (an alias), so you can choose either one.

df.isnull().sum()
df.isna().sum()

Finding which index (or row number) contains missing values can be done analogously to the previous example, simply by adding axis=1. However, it will return all the rows. If you have a large dataset, you can filter it using .index.

df.isnull().sum(axis=1) # all indices
df.index[df.isnull().sum(axis=1) > 0] # only indices with NaN

Finally, you can check the amount of NaN’s per group (or class) as follows. In this example I want to find the number of missing values per color. It can be done for every column and also for one column specifically.

df.isnull().groupby(df.color, sort=False).sum() # all columns
df.taste.isnull().groupby(df.color, sort = False).sum() # 'taste' column only

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Working with NaN’s (nulls/NA’s) in pandas: per column, per row and per group

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Working with NaN’s (nulls/NA’s) in pandas: per column, per row and per group

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error