Apparently, this is something that many (even experienced) data scientists still google. Sometimes you’re dealing with a comma-separated value file that has no header. In this blog post I explain how to deal with this when you’re loading these files with pandas in Python.
The read_csv function in pandas is quite powerful. Compared to many other CSV-loading functions in Python and R, it offers many out-of-the-box parameters to clean the data while loading it.
When you’re dealing with a file that has no header, you can simply set the following parameter to None.
pd.read_csv('file.csv', header = None)
Yet, what’s even better, is that while you have no column names at hand, you can specify them manually, by passing a list to the names parameter.
pd.read_csv('file.csv', header = None, names = ['Column 1', 'Column 2', 'Column 3'])
However, we’re not very efficient in the example above. Did you know that you can simply pass a prefix, and the columns will be numbers automatically?
pd.read_csv('file.csv', header = None, prefix = 'Column ')
In huge CSV files, it’s often beneficial to only load specific columns into memory. In most situations, you’d pass a list of column names to the usecols parameter, yet it can also process a list of integers. To get the first and the third column, this is how you’d do it. Remember that Python uses zero-based indexing.
pd.read_csv('file.csv', header = None, usecols = [0, 2], names = ['Column 1', 'Column 3'])
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.