Let’s say you want to use principal component analysis on the numerical columns in your data set to reduce the amount of features in your model and get rid of multicollinearity. For that, you’d need to select the numerical columns only. Now how could you do that properly?
In the following piece of code I assume I have a data table dt. I do a grepl that returns a logical vector: a TRUE when the column class matches factor, logical or character and FALSE when it doesn’t. I invert this using the ! operator.
We can use this inverted vector to select the column names of the columns that are of a numerical class.
Finally, we can put this expression again in the original data table to actually select the data from the columns we are after.
For this, we use the ‘with’ parameter, so we can refer to the column names using the vector of strings. From the data.table documentation:
“The argument is named
withafter the R function
with()because of similar functionality. […] Setting
with = FALSEdisables the ability to refer to columns as if they are variables, thereby restoring the “
Putting it all together:
library(data.table) dt <- fread('XXX.csv') dt_categorical <- dt[,colnames(dt)[grepl('factor|logical|character',sapply(dt,class))],with=F]
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!