Here’s something I used to bump in a lot when working with external files that I receive from clients: some gibberish prepended to the first column name of a data frame when using read.csv. However, there’s a good reason why this happens.
The first character is a magical character, invisible to the human eye, but readible by a computer. It is the byte order mark (or BOM) and it’s telling the computer that the characters that follow are encoded in Unicode.
However, text editors might interpret this character as something else: namely ï»¿. There are two ways two solve it. The first one, just changing the fileEncoding parameter, doesn’t seem to work for everyone.
read.csv('file.csv', fileEncoding = 'UTF-8-BOM')
So here’s how I always solved it. I simply removed the first three characters of the first column name.
colnames(df) <- gsub('^...','',colnames(df))
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!