5 Must-Know Techniques for Mastering Time-Series Analysis

Author:Murphy  |  View: 29107  |  Time: 2025-03-22 20:18:24

(Yes, I tried to generate time-series plots with a AI tool. I'm actually surprised by the result).

Why You Need to be Careful with Time-Series Analysis

Analyzing time-series data is, most of the time, not straightforward.

This kind of data has unique particularities and challenges that aren't typically found with other datasets.

For example, the temporal order of observations must be respected, and when data scientists do not take that into account, it leads to poor model performance or, worse, entirely misleading predictions.

We will address these challenges using a real dataset, ensuring that the results are reproducible through the provided code examples in this article.

Without proper dealing with time-series data, you risk creating a model that appears to work during training but falls apart in real-world applications.

This is because time-series data is fundamentally different – it evolves over time, meaning patterns such as seasonality, trends, and stationarity must not be forgotten!

Neglecting key aspects, such as how to split data while preserving temporal integrity, accounting for seasonality, and how to handle missing values, can result in data leakage or biased model evaluations.

Let's explore these important aspects together.


Hello there!

My name is Sara, and I am a Data Scientist specializing in AI Engineering. I hold a Master's degree in Physics and later transitioned into the exciting world of Data Science.

I write about data science, artificial intelligence, and data science career advice. Make sure to follow me and subscribe to receive updates when the next article is published!

Sara's Free Data Science Resources


In this article we will cover:

  1. Seasonality and Trends: Spotting the Patterns in Your Data
  2. Feature Engineering for Time-Series
  3. Time-Series Data Splitting – Avoid Data Leakage
  4. How to do Cross-Validation with Time-Series Data
  5. Stationarity and Transformations
  6. Bonus: Detecting and Handling Outliers

Loading the Data

We are going to use Kaggle's Electric Production time-series dataset that can be found and downloaded here.

# Preprocess the data: Convert 'DATE' to datetime and set it as the index
file_path = 'Electric_Production.csv'
df = pd.read_csv(file_path)
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)
df.rename(columns={'IPG2211A2N': 'value'}, inplace=True)

Let's take a look at the data:

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(df['DATE'], df['value'], label='Electric Production')

# Add titles and labels
plt.title('Electric Production Over Time')
plt.xlabel('Date')
plt.ylabel('Electric Production Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()

# Show the plot
plt.show()
Figure 1: Line plot showing electric production values over time. | Image by author

There are many EDA (exploratory data analysis) steps you can take to explore this data, but they are beyond the scope of this article.

Seasonality and Trends: Spotting the Patterns in Your Data

Understanding seasonality and trends is so important in time-series analysis.

What is Seasonality?

Seasonality refers to periodic fluctuations in your data that occur at regular intervals, such as daily, weekly, monthly, or yearly.

For example, retail sales often peak during the holiday season, while electricity consumption may spike during summer months due to air conditioning use.

You need to recognize these seasonal patterns because they can significantly impact your model's predictive power. Ignoring seasonality could lead to poor forecasts and missed opportunities.

Identifying Seasonality

One way to spot seasonality in your data is through visual inspection. Plotting your time-series data can reveal obvious cycles.

However, for a more rigorous approach, you can utilize decomposition techniques. Here's an example using STL decomposition:

from statsmodels.tsa.seasonal import STL
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame with a 'value' column
stl = STL(df['value'], seasonal=12)  # Adjust 'seasonal' parameter based on your data
result = stl.fit()
# Extract components
trend = result.trend
seasonal = result.seasonal
residual = result.resid

# Plot the decomposition
plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(df['value'], label='Original Time Series')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend Component')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonal Component')
plt.legend(loc='upper right')
plt.subplot(414)
plt.plot(residual, label='Residual Component')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()
Figure 2: Example: Seasonality, trend, and residual components from STL decomposition of time-series data | Image by author.

From the image, we can see that the data has a yearly seasonality, with an upward trend. This is common for energy production, which often varies by season.

The seasonal parameter in STL controls the smoothing of the seasonal component. You can adjust this depending on your data's frequency (e.g., monthly, quarterly).

Note: STL assumes that the data points are sequenced in time. STL decomposition expects a series where data points follow one another in their natural temporal order.

Understanding Trends

While seasonality captures predictable fluctuations, trends represent the long-term movement in your data.

A trend can be upward, downward, or flat. Identifying trends is vital because they can influence the overall direction of your forecasts.

For this, you can apply rolling averages to smooth out short-term fluctuations. Here's a quick way to implement that:

# Calculating a 12-month rolling average
df['rolling_avg'] = df['value'].rolling(window=12).mean()

plt.figure(figsize=(10, 5))
plt.plot(df['value'], label='Original Data')
plt.plot(df['rolling_avg'], label='12-Month Rolling Average', color='orange')
plt.title('Original Data vs. Rolling Average')
plt.legend()
plt.show()
Figure 3: Plot comparing original electric production data with a 12-month rolling average trend. | Image by author.

This graph allows you to clearly see the trend amidst seasonal variations, providing a clearer view of the data's overall direction.

A 12-month rolling average will smooth out short-term fluctuations and reveal longer-term trends by averaging the data over a full year (common in time series with seasonality, like this one).

If you want more sensitivity to shorter-term trends without too much smoothing, you can try a 6-month window.

For long-term planning with less noise, a 24-month window may be better.

The Importance of Seasonal and Trend Awareness

Integrating seasonal and trend analysis into your time-series modeling can drastically improve your predictive accuracy.

Additionally, recognizing these patterns helps in feature engineering. You can create features that capture seasonal effects – like month or quarter – as well as lagged variables that account for trends.

This can enhance your model's understanding of the data, and ultimately leading to better predictions!

The 5 Data Science Skills You Can't Ignore in 2024

Understanding Stationarity in Time-Series Data: Key Transformations Explained

Let's explore another crucial aspect of time-series analysis that often trips up even seasoned data scientists: stationarity.

What's the Big Deal with Stationarity?

Stationarity is like the zen state of your time-series data.

It means that the statistical properties of your data – like mean and variancedon't change over time.

Why is this important?

Well, many time-series models assume that your data is stationary. If it's not, your model might not be reliable at all.

Here is an example of how you could check for stationarity:

from statsmodels.tsa.stattools import adfuller

def check_stationarity(timeseries):
    result = adfuller(timeseries, autolag='AIC')
    return result[1]  # Return the p-value
# Assuming 'df' is your DataFrame with multiple columns
for column in df.columns:
    p_value = check_stationarity(df[column])
    print(f"Column '{column}': p-value = {p_value}")
    if p_value <= 0.05:
        print(f"  The series '{column}' is likely stationary")
    else:
        print(f"  The series '{column}' is likely non-stationary")
    print()

It printed:

Column 'value': p-value = 0.18621469116586592
The series 'value' is likely non-stationary

Here we are employing the ADF Test.

Very broadly, the ADF test examines whether a time-series is stationary by testing the null hypothesis that a unit root is present in the series, which would indicate non-stationarity.

In this case, the dataset is non-stationary.

If your dataframe has multiple columns that you want to test, you'd need to apply the test to each column separately.

If the p-value is less than 0.05, great! Your data is likely stationary. If not, don't panic – there are solutions.

Transforming Non-Stationary Data

If your data isn't stationary, here are some common transformations you can try:

  1. Differencing: Subtract each observation from the previous one.
df['diff'] = df['value'].diff()

By applying differencing, the resulting diff column removes the trend, making the series closer to stationary, which is ideal for many forecasting models.

  1. Log Transformation: Great for data with exponential trends.
df['log'] = np.log(df['value'])
  1. Moving Average: Smooth out short-term fluctuations.
df['MA'] = df['value'].rolling(window=12).mean()

Rechecking for Stationarity After Transformations

After applying transformations to your time-series data, it's crucial to verify if you've achieved stationarity!

print("Stationarity check for original data:")
check_stationarity(df['value'])
print("nStationarity check after differencing:")
# Drop NaN values before checking for stationarity
check_stationarity(df['diff'].dropna())

Reversing Transformations for Predictions

Now, let's tackle the pro tip about reversing transformations.

When you make predictions using a model trained on transformed data, you need to "undo" these transformations to get meaningful results.

Here's how you might do this (in another dataset that is not ours):

import numpy as np

# Let's say we've trained a model on log-differenced data
# and made some predictions
log_diff_predictions = model.predict(X_test)

# Step 1: Reverse the log transformation
diff_predictions = np.exp(log_diff_predictions)
# Step 2: Reverse the differencing

# We need the last actual value to start the process
last_actual_value = df['value'].iloc[-1]
original_scale_predictions = []
for diff in diff_predictions:
    prediction = last_actual_value + diff
    original_scale_predictions.append(prediction)
    last_actual_value = prediction
# Now 'original_scale_predictions' contains our forecasts 

# in the original scale of our data
print("Original scale predictions:")
print(original_scale_predictions)

In this example, we first reverse the log transformation using np.exp(), then reverse the differencing by cumulatively adding the differences to the last known actual value.

Why is this important? Imagine presenting a sales forecast to your CEO:

"Our model predicts next month's sales will be 1.2."

CEO: "1.2 what? Millions? Thousands?"

You: "Oh, that's the natural log of the difference from last month's sales."

CEO:

Tags: Data Science Deep Dives k-cross-validation Outlier Detection Time Series Analysis

Comment