Personally, Random Forest is one of my favorite algorithms for supervised learning. It’s quick and dirty and still allows for some interpretation. However, R and the RandomForest package are somewhat cryptic when it comes to requirements not met to properly train the algorithm. I bumped a lot into this error message.
Error in randomForest.default(m,y,...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x): NAs introduced by coercion
2: In data.matrix(x): NAs introduced by coercion
In this blog post I would like to present you a solution. However, there could be multiple solutions. Because here’s what could have gone wrong:
- Your data contains NAs
- Your data contains NaNs
- Your data contains Infs
- Your data contains columns of type ‘character’
In the following paragraphs, I explain how you can check your data table for these issues. Let’s create a sample data set: a 10 by 10 data frame with normal data.
set.seed(19880303) # Setting the seed to my birthday
library(data.table) # install.packages('data.table') if necessary
norms <- list() # Create an empty list
for (i in 1:10) {
norms[[i]] <- data.table(t(rnorm(10,0,1))) # 10 data tables with 10 norms
}
dt <- rbindlist(norms) # binding it all together in a data table
rm(norms) # remove the list
# Now, let's add some issues to our data
dt[5,5] <- Inf # Add an Inf to the data set
dt[4,10] <- NA # Add an NA to the data set
dt[8,3] <- NaN # Add an NaN to the data set
dt[,V9 := as.character(V9)] # add character column
dt$V9 <- sample(c('a','b','c'),10, replace=T)
dt[,V6 := as.character(V6)] # add character column
dt$V6 <- sample(c('a','b','c'),10, replace=T)
This gives me the following data set:
We start of with checking for NAs and NaNs. once you find them you can use multiple techniques to impute the data. The following code will print all the lines that contain an NA or an NaN:
dt[!complete.cases(dt)]
For me personally, this happens a lot when I create features that are ratios and there are some divisions by zero in there. Because in R, division by zero returns in an infinite. If your data table only contains numeric numbers, you can simply do a colSums. However, if your data does not exclusively contain numerical data, these lines of code will print an Inf if there are Inf values in a specific column.
for (i in 1:ncol(dt)) { # For all columns...
if (is.numeric(dt[[i]])) { # if the column is numeric...
print(sum(dt[[i]])) # print the sum of the column.
}
}
Finally, to find all the character columns and automatically convert them to factor columns, the following lines of code should do the trick.
for (i in 1:ncol(dt)) { # For every column...
if (typeof(dt[[i]]) == 'character') { # if the column type is character...
dt[[i]] <- as.factor(dt[[i]]) # Convert it to factor.
}
}
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Good luck!
Pingback: google guidelines backlinks