How to Load Multiple CSV Files into a Pandas DataFrame
CSV (Common Separated Values) is a popular file format used to store and exchange data. In fact, this type of source is commonly used for relatively small volumes of data.
pandas
is a commonly used Python package that lets developers process and transform data as part of analytical and data science tasks. However, beform perforning any task pandas needs to load all the data into memory. This means that the package can only be used for relatively small volumes of data – well this depends on the capabilities of the host machine but for an average machine, we would only be able to load a few GBs of data within memory.
Therefore, loading CSV files in memory and then processing the data with pandas, is a very common task given that such files usually consist of data whose size can be loaded in-memory.
Subscribe to Data Pipeline, a newsletter dedicated to Data Engineering
In this article, we will demonstrate how to load multiple CSV files in a single pandas DataFrame. Furthermore, we'll also showcase how to identify the source file in every record so that we can tell which data point belongs to a certain data file.
Now let's assume that we have data points collected in three separate CSV files namely data_1.csv
, data_2.csv
and data_3.csv
:
colA,colB,colC
'A',1412,True
'B',1252,False
'C',1536,True
'D',1508,False
colA,colB,colC
'E',1115,False
'F',6416,True
'G',6241,True
colA,colB,colC
'H',1267,False
'I',1252,False
'J',2753,False
'K',7346,True
Concatenating multiple CSV files in a single pandas DataFrame
Now that we have some dummy data in three separate CSV files, we can go ahead and import them in one DataFrame.
The first option we have is to read every individual CSV file using pandas.read_csv()
function and concatenate all loaded files into a single DataFrame using pandas.concatenate()
function.
Python">import pandas as pd
data_files = ['data_1.csv', 'data_2.csv', 'data_3.csv']
df = pd.concat((pd.read_csv(filename) for filename in data_files))
Now the newly constructed DataFrame contains all the data points found in the three input CSV files:
>>> df
colA colB colC
0 'A' 1412 True
1 'B' 1252 False
2 'C' 1536 True
3 'D' 1508 False
0 'E' 1115 False
1 'F' 6416 True
2 'G' 6241 True
0 'H' 1267 False
1 'I' 1252 False
2 'J' 2753 False
3 'K' 7346 True
Notice how the index of our DataFrame is reset when loading data from multiple files. If instead you'd want to create a new index for the newly created DataFrame, all you need to do is ignore the index whilst concatenating the files:
import pandas as pd
data_files = ['data_1.csv', 'data_2.csv', 'data_3.csv']
df = pd.concat(
(pd.read_csv(filename) for filename in data_files),
ignore_index=True
)
>>> df
colA colB colC
0 'A' 1412 True
1 'B' 1252 False
2 'C' 1536 True
3 'D' 1508 False
4 'E' 1115 False
5 'F' 6416 True
6 'G' 6241 True
7 'H' 1267 False
8 'I' 1252 False
9 'J' 2753 False
10 'K' 7346 True
Avoid specifying explicit file names
Now let's assume that we have hundreds of different CSV files that we would want to concateante into a single DataFrame. Instead of wasting time and lines of code to explicitly write all the individual filenames down, we could instead use a whitecard.
We can do so, by taking advantage of glob
module, which is part of the standard libary and offers functionality to support unix style pathname pattern expansion. For example, in order to create a list containing of all the files ending with .csv
filename under the current directory, we could use the following code snippet:
>>> import glob
>>>
>>> data_files = glob.glob('*.csv')
>>> data_files
['data_2.csv', 'data_3.csv', 'data_1.csv']
The full code that reads all csv files under current directory and concatenates them in a single pandas DataFrame is shared below:
import glob
import pandas as pd
data_files = glob.glob('*.csv')
df = pd.concat(
(pd.read_csv(filename) for filename in data_files),
ignore_index=True
)
Another approach – and perhaps a more Pythonic one – is one that takes advantage map
built-in function that lets us run a method or function over an Iterable (like a Python List) without having to explicitly call a for
loop:
import glob
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob('*.csv')))
Identifying records coming from different files
In some other use-cases, knowing the original source file of a given record could be a piece of information that we would also need to track.
import glob
import pandas as pd
df = pd.concat(
[
pd.read_csv(filename).assign(source=filename)
for filename in glob.glob('*.csv')
],
ignore_index=True
)
Now we expect that our DataFrame will hold an additional column that specifies the corresponding filename from which every record was added into the DataFrame:
>>> df
colA colB colC source
0 'E' 1115 False data_2.csv
1 'F' 6416 True data_2.csv
2 'G' 6241 True data_2.csv
3 'H' 1267 False data_3.csv
4 'I' 1252 False data_3.csv
5 'J' 2753 False data_3.csv
6 'K' 7346 True data_3.csv
7 'A' 1412 True data_1.csv
8 'B' 1252 False data_1.csv
9 'C' 1536 True data_1.csv
10 'D' 1508 False data_1.csv
Final Thoughts
In this article, we demonstrated how to load multiple CSV files and concatenate them into a single DataFrame in pandas. Additionally, we showcased how you can actually perform this import without having to explicitly specify the file names to be loaded. Finally, we discussed how you can even create a new column in the loaded DataFrame such that you can then identify the source file for each record.
Now that you have loaded your data into pandas, you can take advantage of the rich API the package offers and lets you peform analyses, transformations and any kind of processing you might have to do. If you are looking to write a pandas DataFrame back to a CSV file make sure to follow the guide below