In this blog post I explain how you might be able to fix your CSV files if some of their values contain the same character that is used as delimiter (separator).
A problem I recently ran into is that I received a CSV file that contained fields that contained the character that was used as the delimiter. This error that was generated was the following:
Stopped early on line …. Expected … fields but found …. Consider fill=TRUE and comment.char=. First discarded non-empty line:
One can fix this manually, by opening the text file and replacing the character everywhere. However, there might be a faster way.
If you have only a couple of values that keep returning throughout your file, you can do a search and replace. If your file is enormous, you cannot do that through Notepad(++). Furthermore, if this file will be updated recurrently, without fixing the problem, you might want to make this reproducible.
Here’s some example data. As you can see, when something gets sold to a customer in Greenland, you can see that it is stored as “Greenland;Denmark”, which is problematic because it contains a semicolon, which is also the file’s delimiter.
ordernr;country;client;total
abc123;France;The Croissant Factory;200
abc124;Germany;The Sausage Factory;300
abc125;Greenland;Denmark;The Ice Cream Factory;150
abc126;Turkey;The Baklava Factory;200
abc127;Greenland;Denmark;The Whipcream Factory;120
abc128;Germany;The Potato Company;250
Instead of reading in this file as a data frame through fread() or read.csv(), you can just read this in as flat text using the readLines function. You can inspect your file, or read chunks of it into R to find out what the patterns are that you need to replac.
Next, you can make a series of gsub()‘s to replace the patterns that cause issues. You can even chain/pipe the gsub function with the magrittr library.
And finally, you write the text away to a new CSV file.
tx <- readLines('file.csv')
tx <- gsub('Greenland\\;Denmark','Greenland', tx)
writeLines(tx, 'newfile.csv')
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Good job!
Najlepsza aplikacja do kontroli rodzicielskiej, aby chronić swoje dzieci – potajemnie tajny monitor GPS, SMS-y, połączenia, WhatsApp, Facebook, lokalizacja. Możesz zdalnie monitorować aktywność telefonu komórkowego po pobraniu i zainstalowaniu apk na telefonie docelowym. https://www.mycellspy.com/pl/