How to: Handle Missing Data for Time Series

Author:Murphy | View: 22251 | Time: 2025-03-22 21:35:47

There is no such thing as a perfect dataset. Every data scientist knows that feeling during data exploration when they call:

df.info()

and see something like this:

UCI Machine Learning Repository Air Quality dataset info (CC BY 4.0). Image by author

Most ML models cannot process NaN or null values, so it is important that if your features or target contain them, they are dealt with appropriately before attempting to fit a model to the data.

In this article, I will explore 3 simple ways to handle nulls/missing data in time series datasets.

1. Drop nulls

This is probably the simplest and most straightforward way to handle missing data: Just get rid of it.

# Drop any and all nulls across all columns
df.dropna(inplace=True)

By default, pandas' dropna function searches for nulls across the board (in all columns) and drops any row where there is a null in any column. However, this can be modified using various parameters.

In the case of this dataset, notice how column NMHC(GT) has only 914 non-null values. Therefore, if we drop any and all nulls, we will only end up with at most 914 (probably less) rows of data for our model. This is a drastic drop from the original 9,357!

By specifying a subset of **** columns, pandas will only drop rows in the data frame where there are nulls in those specific columns.

df.dropna(subset=['CO(GT)','PT08.S1(CO)'], inplace=True)

This way we can sort of mix and match methods, where we drop nulls in some columns and deal with other columns differently.

You can also specify if you want to only drop rows where all columns are null through setting the argument how to "all". The default value for how is "any".

2. Interpolate nulls

Another simple way to fill in null values is through interpolation. Pandas' interpolate method uses linear interpolation as a default.

Linear interpolation basically takes the two values that come before and after the null value and creates a line between the two. It then uses this line to estimate the value of the missing data point. Pandas' interpolate method assumes that each data point is equally spaced. If you do not have a row for every possible timestamp, you can set the interpolate method to "time" as long as you have a datetime index. That way, if you have two consecutive rows that are >1 interval apart (eg >1 day or 1 hour), the interpolation will take this distance into account.

If this is the first index, since there are no values preceding the null, it will not be interpolated.

In this case, the interpolation is simple since there is 1 null right in the middle of two known values. All values are equally spaced by 1 hour. The null value at index 10 will simply be the average of the values before and after (0.65).

If there are 2 or more consecutive NaNs, they will be interpolated taking the distance between them and the known values into account.

See here for more detailed information on how linear interpolation works.

You can set a limit on how many consecutive NaNs to interpolate via the limit keyword argument. If there are a ton of consecutive NaNs, you may want to just drop these after a certain point of interpolation because each interpolation introduces uncertainty into the algorithm. More interpolation = increased uncertainty, especially in the case of time series.

3. Impute nulls

The last method I want to cover is imputation. Imputation essentially means that you are filling in nulls with the mean or median of your data.

The simplest way to do this is using pandas' fillna and taking the median of the entire column.

df.fillna(df['CO(GT)'].median())

But when it comes to time series, often the median of the entire dataset will not be accurate. Time series data often has seasonal patterns, where usage varies based on the hour of the day, the day of week, month, etc.

For this example I decided to impute the CO(GT) column using the median for that hour.

In order to be able to impute with the median I came up with my own solution since there isn't a straightforward method or library to do this (that I'm aware of). I had to first create a dataframe which contained all of the medians for their respective hours.

# Create dataframe which contains the median for each column grouped by 
# hour
hour_df = pd.DataFrame(df.groupby([df.index.hour]).median())
hour_df.reset_index(inplace=True)

Resulting dataframe of grouping median by hour. Only first 5 hours + 3 columns shown. Image by author

Next, I created a function called get_hour_median. Although I'm only showcasing this for the CO(GT) column, I made this function flexible enough so that it can handle any column name.

def get_hour_median(hour,col_name):
    median = hour_df[hour_df['Datetime']==hour][col_name].values[0]
    return median

I then applied this function to the CO(GT) column using apply and one other custom function.

# Reset datetime index for easier processing in functions below
df.reset_index(inplace=True)

# Takes in a dataframe row and returns a median value if row
# is null, otherwise return original value.
def fill_with_hourly_median(row,col_name):
    if pd.isnull(row[col_name]):
        return get_hour_median(row['Datetime'].hour,col_name)
    else:
        return row[col_name]

# Apply fill_with_hourly_median to CO(GT) column
df['CO(GT)'] = df.apply(fill_with_hourly_median, axis=1, col_name='CO(GT)')

The CO(GT) column should now have median values filled in for the corresponding hours instead of NaNs.

Which to choose?

A lot of times you will use a combination of different methods for different columns. For example, since linear interpolation won't fill in the first values in a column, if there are null rows at the beginning of the dataframe, these can be dropped after those in the middle of the dataframe have been interpolated.

If you have a lot of data, and not that many nulls, dropping a few rows won't make a huge difference. In this case, dropping is usually my preferred method since all the data I'll be feeding into the model is actual data.

For small gaps (1–2 missing rows) that show up sporadically across the dataset, I will typically use interpolation. If the gaps are larger, though, with lots of consecutive nulls I would consider using the median up until a certain threshold (>6–10 but can depend on the granularity of the data and how consistent the patterns are), after which I'd start dropping rows.

As you can see, though dealing with missing data is a common occurrence, there are a lot of considerations that go into how to handle it. The methods I mentioned are by no means the only ones but there is a lot you can do with just these 3 options.

I recommend thoroughly exploring your time series data by plotting it out and identifying where the nulls are, whether the gaps are small or large, and what kinds of seasonal patterns are present. With time and practice, you'll develop a better intuition for how to best handle gaps in your data.