Skip to content
Home ยป Fixing CSVs and edit text files using R

Fixing CSVs and edit text files using R

  • by
  • 2 min read

In this blog post I explain how you might be able to fix your CSV files if some of their values contain the same character that is used as delimiter (separator).

A problem I recently ran into is that I received a CSV file that contained fields that contained the character that was used as the delimiter. This error that was generated was the following:

Stopped early on line …. Expected … fields but found …. Consider fill=TRUE and comment.char=. First discarded non-empty line:

One can fix this manually, by opening the text file and replacing the character everywhere. However, there might be a faster way.

If you have only a couple of values that keep returning throughout your file, you can do a search and replace. If your file is enormous, you cannot do that through Notepad(++). Furthermore, if this file will be updated recurrently, without fixing the problem, you might want to make this reproducible.

Here’s some example data. As you can see, when something gets sold to a customer in Greenland, you can see that it is stored as “Greenland;Denmark”, which is problematic because it contains a semicolon, which is also the file’s delimiter.

abc123;France;The Croissant Factory;200
abc124;Germany;The Sausage Factory;300
abc125;Greenland;Denmark;The Ice Cream Factory;150
abc126;Turkey;The Baklava Factory;200
abc127;Greenland;Denmark;The Whipcream Factory;120
abc128;Germany;The Potato Company;250

Instead of reading in this file as a data frame through fread() or read.csv(), you can just read this in as flat text using the readLines function. You can inspect your file, or read chunks of it into R to find out what the patterns are that you need to replac.

Next, you can make a series of gsub()‘s to replace the patterns that cause issues. You can even chain/pipe the gsub function with the magrittr library.

And finally, you write the text away to a new CSV file.

tx <- readLines('file.csv')
tx <- gsub('Greenland\\;Denmark','Greenland', tx)
writeLines(tx, 'newfile.csv')

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Good job!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *