How to Load Multiple CSV Files into a Pandas DataFrame

Author:Murphy  |  View: 29909  |  Time: 2025-03-23 19:57:40

CSV (Common Separated Values) is a popular file format used to store and exchange data. In fact, this type of source is commonly used for relatively small volumes of data.

pandas is a commonly used Python package that lets developers process and transform data as part of analytical and data science tasks. However, beform perforning any task pandas needs to load all the data into memory. This means that the package can only be used for relatively small volumes of data – well this depends on the capabilities of the host machine but for an average machine, we would only be able to load a few GBs of data within memory.

Therefore, loading CSV files in memory and then processing the data with pandas, is a very common task given that such files usually consist of data whose size can be loaded in-memory.


Subscribe to Data Pipeline, a newsletter dedicated to Data Engineering


In this article, we will demonstrate how to load multiple CSV files in a single pandas DataFrame. Furthermore, we'll also showcase how to identify the source file in every record so that we can tell which data point belongs to a certain data file.

Now let's assume that we have data points collected in three separate CSV files namely data_1.csv, data_2.csv and data_3.csv :

colA,colB,colC
'A',1412,True
'B',1252,False
'C',1536,True
'D',1508,False
colA,colB,colC
'E',1115,False
'F',6416,True
'G',6241,True
colA,colB,colC
'H',1267,False
'I',1252,False
'J',2753,False
'K',7346,True

Concatenating multiple CSV files in a single pandas DataFrame

Now that we have some dummy data in three separate CSV files, we can go ahead and import them in one DataFrame.

The first option we have is to read every individual CSV file using pandas.read_csv() function and concatenate all loaded files into a single DataFrame using pandas.concatenate() function.

Python">import pandas as pd 

data_files = ['data_1.csv', 'data_2.csv', 'data_3.csv']
df = pd.concat((pd.read_csv(filename) for filename in data_files))

Now the newly constructed DataFrame contains all the data points found in the three input CSV files:

>>> df
  colA  colB   colC
0  'A'  1412   True
1  'B'  1252  False
2  'C'  1536   True
3  'D'  1508  False
0  'E'  1115  False
1  'F'  6416   True
2  'G'  6241   True
0  'H'  1267  False
1  'I'  1252  False
2  'J'  2753  False
3  'K'  7346   True

Notice how the index of our DataFrame is reset when loading data from multiple files. If instead you'd want to create a new index for the newly created DataFrame, all you need to do is ignore the index whilst concatenating the files:

import pandas as pd 

data_files = ['data_1.csv', 'data_2.csv', 'data_3.csv']
df = pd.concat(
  (pd.read_csv(filename) for filename in data_files), 
  ignore_index=True
)

>>> df
   colA  colB   colC
0   'A'  1412   True
1   'B'  1252  False
2   'C'  1536   True
3   'D'  1508  False
4   'E'  1115  False
5   'F'  6416   True
6   'G'  6241   True
7   'H'  1267  False
8   'I'  1252  False
9   'J'  2753  False
10  'K'  7346   True

Avoid specifying explicit file names

Now let's assume that we have hundreds of different CSV files that we would want to concateante into a single DataFrame. Instead of wasting time and lines of code to explicitly write all the individual filenames down, we could instead use a whitecard.

We can do so, by taking advantage of glob module, which is part of the standard libary and offers functionality to support unix style pathname pattern expansion. For example, in order to create a list containing of all the files ending with .csv filename under the current directory, we could use the following code snippet:

>>> import glob
>>>
>>> data_files = glob.glob('*.csv')
>>> data_files
['data_2.csv', 'data_3.csv', 'data_1.csv']

The full code that reads all csv files under current directory and concatenates them in a single pandas DataFrame is shared below:

import glob

import pandas as pd 

data_files = glob.glob('*.csv')
df = pd.concat(
  (pd.read_csv(filename) for filename in data_files), 
  ignore_index=True
)

Another approach – and perhaps a more Pythonic one – is one that takes advantage map built-in function that lets us run a method or function over an Iterable (like a Python List) without having to explicitly call a for loop:

import glob

import pandas as pd

df = pd.concat(map(pd.read_csv, glob.glob('*.csv')))

Identifying records coming from different files

In some other use-cases, knowing the original source file of a given record could be a piece of information that we would also need to track.

import glob
import pandas as pd

df = pd.concat(
  [
    pd.read_csv(filename).assign(source=filename)
    for filename in glob.glob('*.csv')
  ], 
  ignore_index=True
)

Now we expect that our DataFrame will hold an additional column that specifies the corresponding filename from which every record was added into the DataFrame:

>>> df
   colA  colB   colC      source
0   'E'  1115  False  data_2.csv
1   'F'  6416   True  data_2.csv
2   'G'  6241   True  data_2.csv
3   'H'  1267  False  data_3.csv
4   'I'  1252  False  data_3.csv
5   'J'  2753  False  data_3.csv
6   'K'  7346   True  data_3.csv
7   'A'  1412   True  data_1.csv
8   'B'  1252  False  data_1.csv
9   'C'  1536   True  data_1.csv
10  'D'  1508  False  data_1.csv

Final Thoughts

In this article, we demonstrated how to load multiple CSV files and concatenate them into a single DataFrame in pandas. Additionally, we showcased how you can actually perform this import without having to explicitly specify the file names to be loaded. Finally, we discussed how you can even create a new column in the loaded DataFrame such that you can then identify the source file for each record.

Now that you have loaded your data into pandas, you can take advantage of the rich API the package offers and lets you peform analyses, transformations and any kind of processing you might have to do. If you are looking to write a pandas DataFrame back to a CSV file make sure to follow the guide below

Tags: Data Engineering Data Science Programming Python Technology

Comment