Removing Ï.., I and two dots or umlaut, when using read.csv in R

roelpi

5 years ago

Here’s something I used to bump in a lot when working with external files that I receive from clients: some gibberish prepended to the first column name of a data frame when using read.csv. However, there’s a good reason why this happens.

The first character is a magical character, invisible to the human eye, but readible by a computer. It is the byte order mark (or BOM) and it’s telling the computer that the characters that follow are encoded in Unicode.

However, text editors might interpret this character as something else: namely ï»¿. There are two ways two solve it. The first one, just changing the fileEncoding parameter, doesn’t seem to work for everyone.

read.csv('file.csv', fileEncoding = 'UTF-8-BOM')

So here’s how I always solved it. I simply removed the first three characters of the first column name.

colnames(df)[1] <- gsub('^...','',colnames(df)[1])

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.