Home » Replacing NaN/Missing in Julia DataFrames

Replacing NaN/Missing in Julia DataFrames

  • by
  • 3 min read

Replacing, excluding or imputing missing values is a basic operation that’s done in nearly all data cleaning processes. In my third blog post on Julia I give an overview of common solutions for replacing missing values.

First, let’s create a dummy DataFrame as an example. Both columns, a and b, have both NaNs and missings. Column a is of type Float64 and column b is of type any.

using DataFrames
using BenchmarkTools

df = DataFrame(
    a = [0,1,2,missing,4,5,NaN,7,missing,9], 
    b = ["a","b","c","d",missing,"f",NaN,"g","h","i"]
)

Replace missing values in a DataFrame

Replace missing values in one column

I present you fiveways to replace missing values.

  1. The first solution feels very R-ish, but is rather slow.
  2. The second solution uses the replace method from the (pre-installed) Missings package. It doesn’t improve the speed a lot.
  3. The third solution uses the coalesce function from the DataFrames package. It doesn’t appear to have an in-place version.
  4. The fourth solution uses the replace function and is fast as lightning.
  5. But we can even go faster. By using the in-place version of replace, we can double the speed once again.
@benchmark df[ismissing.(df.a),:a] = 0 # Median time: 38.7 µs
@benchmark collect(Missings.replace(df[:a], 0)) # 32.6 µs
@benchmark df.a = coalesce.(df.a, 0) # Median time: 5.4 µs
@benchmark df.a = replace(df.a,missing => 0) # Median time: 0.2 µs
@benchmark replace!(df.a,missing => 0) # Median time: 0.08 µs (!)

It’s interesting to know that you can prevent missings from entering your dataframe. The first line of the code disallows missing values in column a.

disallowmissing!(df, :a)
df[2, :a] = missing

In the second line, we try to assign a missing to the second item in column a. This will not work and throw the following error:

MethodError: Cannot `convert` an object of type Missing to an object of type Float64

Replace missing values in all colums

The following example uses the fastest solution (see above) in a for loop. I wish there was some more elegant solution that can do it in one line of code, but I haven’t been able to find one. Unsurprisingly, list comprehension and map() return two vectors, instead of a DataFrame.

for col in eachcol(df)
    replace!(col,missing => 0)
end

Replace NaN in a DataFrame

Using the R-ish solution and the isnan function, we can easily do it as follows:

df[isnan.(df.a), :a] .= 0

However, there is a drawback. The isnan function only accepts a Float-type column, and not a String column. If you pass it a string column, it will generate an error:

MethodError: no method matching isnan(::String)

That’s why I prefer the following solution. In the next chunk of code I give the solution for both a specific column and for all columns.

replace!(df.a, NaN => 0)

for col in eachcol(df)
    replace!(col,NaN => 0)
end

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *