Time Series Regression and Cross-Validation: A Tidy Approach

Author:Murphy  |  View: 29966  |  Time: 2025-03-22 21:14:44
Photo by Aron Visuals on Unsplash

Time series Forecasting approaches are always evolving and while ARIMA has been the foundation for eons, machine learning models also show great promise. There are scenarios across industries that can benefit from more accurately modeling their data over time. In this article, I am going to tackle one such scenario – forecasting sales. Let's get straight into it to save time (get it?).


Code

Code for recreating everything in this article can be found at my GitHub repo.

Dataset

For this exercise, I use a Time Series Practice dataset by Samuel Cortinhas, available under a CC0: Public domain from Kaggle. This dataset contains simulated time series data covering 10 years (2010–2019) and includes the features date, store id, product id and sales. For this analysis, I picked a single store and product to focus on the regression components.


Time series analysis

Step 0: The set up

I will be using the following packages to build my data exploration and regression. Prior to loading, they can be installed using the command install.packages("package_name").

# Load required libraries
library(tidyverse)
library(lubridate)
library(tidymodels)
library(modeltime)
library(timetk)
library(viridis)

Step 1: Dates and everything they can be!!

Dates are my favorite variable, because a single series of dates is packed with so much information. In this scenario, I will create new features from the date column to explore their relationship with sales later. But first, I will clean the date column using as.Date(), because it is not of much use as a character string.

data <- data |> 
  mutate(date = as.Date(date, format = "%m/%d/%Y"))

Next, I will create new features from this date column for regression. Lubridate package provides a series of handy functions to make this process simpler.

# Preprocess data to include time-related factors
df <- data |> 
  mutate(
    year = year(date),  
    semester = factor(semester(date)),  
    quarter = factor(quarter(date)),  
    day_in_week = factor(wday(date, label = TRUE)),  
    week_in_year = factor(week(date)),  
    day_in_year = factor(yday(date)), 
    month = factor(month(date, label = TRUE)) 
  )

This is what the data now looks like:

Step 2: Exploratory Data Analysis

With all these new interesting features generated from a single date column, I will explore their relationship with sales. I will start with yearly seasonality.

df %>%
  ggplot(aes(date, sales)) +
  geom_line(alpha = 1, size = 1, color = "darkblue") +  
  theme_bw() +
  labs(title = "Daily Sales Distribution Over Time", x = "Date", y = "Sales") +
  scale_x_date(date_labels = "%Y", date_breaks = "2 years") +  
  scale_y_continuous(labels = scales::comma) 

There is clear seasonality and certain trend in the sales data over time. I will now look at its relationship with the day of week.

df |>
  ggplot(aes(day_in_week, sales, color = day_in_week)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.1) +
  theme_bw() +
  scale_colour_viridis_d() +
  labs(title = "Daily Sales Distribution by Day of Week", x = "Day of the Week", y = "Sales") +
  scale_x_discrete() +
  scale_y_continuous(labels = scales::comma)

Now, I will look at distribution by month.

df |> 
  ggplot(aes(month, sales)) +
  geom_violin(color = "darkgreen") +  
  geom_jitter(alpha = 0.2, aes(color = sales)) +  
  theme_light() +
  geom_smooth(method = "loess", se = FALSE) +  
  scale_colour_viridis_c() +
  labs(title = "Daily Sales Distribution by Month", x = "", y = "Sales", color= "Sales")

Now, let's move on to the modeling.

Step 3: Split the data and set up cross-validation

Train and test set

I will use the initial_time_split() function from tidymodels to split my data into training and testing set.


df_split <- df |> 
  initial_time_split(prop=0.9)  # 90% of data for training, 10% for testing

df_train <- training(df_split)
df_test <- testing(df_split)

This is what the split looks like:

Setting up cross-validation

Cross-validation can be used to evaluate models by training them on subsets of input data and evaluating them on the remaining data. Regular cross-validation methods randomly pick data points to assign them to training dataset. However, this method is not suitable for time series because the data needs to be in sequence. Timetk package has a great function called time_series_cv() that can be used specifically to create cross-validation folds for time series datasets and visualize those splits.

# Create the cross-validation folds
df_folds <- 
  time_series_cv(
    df_train, 
    initial = "3 years", 
    assess = "1 year", 
    skip = "6 months",
    slice_limit = 5)  

# Visualize the splits
plot_time_series_cv_plan(df_folds, date, sales)

Step 4: Data preparation for modeling

A recipe is an object that can be used to create new predictors and conduct some preprocessing required by the model. I will create two recipe objects because I want to try two models, one for auto ARIMA and another for random forest. The first step in a recipe() is the formula for the regression.

recipe_autoarima <- 
  recipe(sales ~ date,
         data = df_train)

For the second random forest recipe, I will add a feature using step_holiday(), which converts date data into one or more binary indicator variables for common holidays. I will remove the date column using step_rm() and turn all nominal predictors to dummy variables using step_dummy().

recipe_rf <- 
  recipe(sales ~ ., data = df_train) |>
  step_holiday(date, holidays = timeDate::listHolidays("US")) |>  
  step_rm(date) |>  
  step_dummy(all_nominal_predictors())

This is what a recipe object looks like:

Recipe objects are great because they help create a sequence of steps that are replicable across training and testing datasets without having to write long feature engineering code. I think of it as feature engineering shorthand!

The command below can be run to understand how the recipe object updates a dataset and all the new and edited columns it creates.

recipe_rf |> prep() |> bake(new_data = NULL)

Step 5: Model specifications and Resampling

Once the recipes are created, I will create a set of specifications for each of the two models. I will use Modeltime for ARIMA and Tidymodels for random forest. These provide a great way to create specifications for multiple models with a consistent flow of steps.

# Model specification for auto arima
auto_arima_spec <- arima_boost() |> 
  set_mode("regression") |> 
  set_engine('auto_arima_xgboost')

# Model specification for random forest
rf_spec <- 
  rand_forest(trees = 500) |>  
  set_mode("regression") |> 
  set_engine("ranger")

Step 6: Workflow sets

A workflow() is an object that can bundle together pre-processing, modeling, and post-processing requests. When using multiple models and resampling, I found it more convenient to use workflow sets.

I will now combine the recipe objects with the corresponding model specifications using workflow_set(). By default the cross parameter is set to TRUE, which makes each recipe object match with each model specification. In this case I will set it to FALSE because I want to match each recipe object with its respective model specification.

workflowset_df <- 
  workflow_set(
    list(recipe_autoarima, recipe_rf),
    list(auto_arima_spec, rf_spec),
    cross=FALSE
  )

This is what the workflow object looks like. Note that it does not currently have any results, but placeholders to store once the cross-validation is done.

Step 7: Time to fit the models

So far, I have created new features, preprocessed the data, created model specifications, and built workflow sets. Now I will fit the models using training set and resample folds. I will store the results in df_results to analyze later.

df_results <-
  workflow_map(
    workflowset_df,
    "fit_resamples",
    resamples = df_folds
  )

This will loop over all the workflows in workflowset_df, applying the fit_resamples function to each one, using the df_folds cross-validation objects. The results of each of these runs is stored under the respective model row in df_results. Note that the result column is now populated with the results of cross-validation. These results can be extracted using unnest() if needed.

The results can also be can be quickly visualized using theautoplot() command, which is a really cool feature of tidymodels.

autoplot(df_results)  

It looks like across all the resamples, random forest has better RMSE (Root Mean Squared Error) and R-Squared as compared to ARIMA.

Step 8: Fit the testing data

I will now identify the best performing model based on lowest RMSE by ranking the results from df_results. I will retrieve its id and parameters and fit this optimized model to the training data. Then, I will use this model to evaluate on the test dataset and check the new metrics.

# Get the id of the workflow that has the highest rmse
best_workflow_id <- df_results %>%
  rank_results(rank_metric = "rmse") %>%
  head(1) %>%
  pull(wflow_id)  

## Get the parameters associated with the best_workflow_id.
best_params <- df_results %>%
  extract_workflow_set_result(id = best_workflow_id) %>%
  select_best(metric = "rmse")  

## Get the workflow associated with the best_workflow_id.
best_workflow <- df_results %>%
  extract_workflow(id = best_workflow_id)  

# Finalize the workflow with the best parameters
finalized_workflow <- finalize_workflow(best_workflow, best_params) 

finalized_workflow

This is what the final workflow looks like:

I will now use this workflow to predict on the test dataset using fit() and augment(). I will measure the metrics based on these predictions.

predictions <- finalized_workflow %>%
  fit(df_train) %>%
  augment(df_test)  

## Calculate evaluation metrics
evaluation_metrics <- metric_set(rmse, mae, rsq)
results <- evaluation_metrics(predictions, truth = sales, estimate = .pred) 
print(results)  

This is what the final metrics look like:

Pretty good!


You made it!

I hope this article made it clear how tidymodels, modeltime and timetk provide a powerful framework to build time series regression models and that you will give it a go! While I created this for a single store and product id, this could be a great usecase for building a Shiny web application that can replicate this for any store and product combination. Happy coding.


Code

Code for recreating everything in this article can be found at my GitHub repo.


If you'd like, find me on Linkedin.


Unless otherwise mentioned, all images in the article are by the author.

Tags: Crossvalidation Data Science Forecasting Machine Learning Time Series Analysis

Comment