Home » How to only select categorical or numerical columns in R

How to only select categorical or numerical columns in R

  • by
select categorical columns only
Want to do a random act of kindness? Share this post.

Let’s say you want to use principal component analysis on the numerical columns in your data set to reduce the amount of features in your model and get rid of multicollinearity. For that, you’d need to select the numerical columns only. Now how could you do that properly?

In the following piece of code I assume I have a data table dt. I do a grepl that returns a logical vector: a TRUE when the column class matches factor, logical or character and FALSE when it doesn’t. I invert this using the ! operator.

!grepl('factor|logical|character',sapply(dt,class))

We can use this inverted vector to select the column names of the columns that are of a numerical class.

colnames(dt)[!grepl('factor|logical|character',sapply(dt,class))]

Finally, we can put this expression again in the original data table to actually select the data from the columns we are after.

dt[,colnames(dt)[grepl('factor|logical|character',sapply(dt,class))],with=F]

For this, we use the ‘with’ parameter, so we can refer to the column names using the vector of strings. From the data.table documentation:

“The argument is named with after the R function with() because of similar functionality. […] Setting with = FALSE disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode”.

Putting it all together:

library(data.table)
dt <- fread('XXX.csv')

dt_categorical <- dt[,colnames(dt)[grepl('factor|logical|character',sapply(dt,class))],with=F]

Great success!

Want to do a random act of kindness? Share this post.