In this blog post I discuss how you can load compressed CSV files, such as .zip and .tar.gz. Nowadays, many packages support it and we’ll go over the different methods.
When data sets are ping-ponged across an organization, in order to limit network and storage usage, they often come in a compressed format. Instead of losing time unzipping the file manually, it’s perfectly fine to load these files directly into R.
Using base code, loading a compressed file containing one or two CSV files can be done using the unz function. You can even load files that are within a folder inside that ZIP file.
read.csv(unz('twofiles.zip','second_file.csv'), header = T) read.csv(unz('onefile.zip','only_file.csv'), header = T) read.csv(unz('twofiles_in_folder.zip','twofiles/mtcars2.csv'), header = T)
Read a zipped file using data.table‘s fread() can be done by specifying a CLI command. You need to have (g)unzip in your PATH variable, or have (g)unzip in your project folder. By the way, you can achieve the same with 7-zip.
fread(cmd = 'unzip -p onefile.zip') # Windows fread(cmd = 'gunzip -cq onefile.zip') # Linux fread(cmd = '7z e -so onefile.Zip') # 7-zip
Using vroom, loading a single zipped file is even easier because you don’t need to specify any commands, at all.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!