Replacing, excluding or imputing missing values is a basic operation that’s done in nearly all data cleaning processes. In my third blog post on Julia I give an overview of common solutions for replacing missing values.
First, let’s create a dummy DataFrame as an example. Both columns, a and b, have both NaNs and missings. Column a is of type Float64 and column b is of type any.
using DataFrames using BenchmarkTools df = DataFrame( a = [0,1,2,missing,4,5,NaN,7,missing,9], b = ["a","b","c","d",missing,"f",NaN,"g","h","i"] )
Replace missing values in a DataFrame
Replace missing values in one column
I present you fiveways to replace missing values.
- The first solution feels very R-ish, but is rather slow.
- The second solution uses the replace method from the (pre-installed) Missings package. It doesn’t improve the speed a lot.
- The third solution uses the coalesce function from the DataFrames package. It doesn’t appear to have an in-place version.
- The fourth solution uses the replace function and is fast as lightning.
- But we can even go faster. By using the in-place version of replace, we can double the speed once again.
@benchmark df[ismissing.(df.a),:a] = 0 # Median time: 38.7 µs @benchmark collect(Missings.replace(df[:a], 0)) # 32.6 µs @benchmark df.a = coalesce.(df.a, 0) # Median time: 5.4 µs @benchmark df.a = replace(df.a,missing => 0) # Median time: 0.2 µs @benchmark replace!(df.a,missing => 0) # Median time: 0.08 µs (!)
It’s interesting to know that you can prevent missings from entering your dataframe. The first line of the code disallows missing values in column a.
disallowmissing!(df, :a) df[2, :a] = missing
In the second line, we try to assign a missing to the second item in column a. This will not work and throw the following error:
MethodError: Cannot `convert` an object of type Missing to an object of type Float64
Replace missing values in all colums
The following example uses the fastest solution (see above) in a for loop. I wish there was some more elegant solution that can do it in one line of code, but I haven’t been able to find one. Unsurprisingly, list comprehension and map() return two vectors, instead of a DataFrame.
for col in eachcol(df) replace!(col,missing => 0) end
Replace NaN in a DataFrame
Using the R-ish solution and the isnan function, we can easily do it as follows:
df[isnan.(df.a), :a] .= 0
However, there is a drawback. The isnan function only accepts a Float-type column, and not a String column. If you pass it a string column, it will generate an error:
MethodError: no method matching isnan(::String)
That’s why I prefer the following solution. In the next chunk of code I give the solution for both a specific column and for all columns.
replace!(df.a, NaN => 0) for col in eachcol(df) replace!(col,NaN => 0) end