The Return of the Fallen: Transformers for Forecasting

Author:Murphy | View: 25446 | Time: 2025-03-23 18:31:24

Lately, there has been a significant surge in the adoption of Transformer-based approaches. The remarkable achievements of models like BERT and ChatGPT have encouraged researchers to explore the application of this architecture in various areas, including time series forecasting. However, recent work by researchers at the Chinese University of Hong Kong and the International Digital Economy Activity showed that the implementations developed for this task were less than optimal and could be beaten by a simple linear model on various benchmarks [1].

In response, researchers at Princeton and IBM proposed PatchTST (Patched Time Series Transformer) in their paper A Time Series is Worth 64 Words [2]. In this paper, Nie et. al introduces 2 key mechanisms that bring transformers back to the forecasting arena:

Patched attention: Their attention takes in large parts of the time series as tokens instead of a point-wise attention
Channel Independence: different target series in a time series are processed independently of each other with different attention weights.

In this post, I aim to summarize how these two mechanisms work and discuss the implications of the results discovered by Nie et. al [2].

Background: Are Transformers Effective

Before we dive into PatchTST, we need to first understand the problems that Zeng et. al discovered with self-attention in the forecasting space. For those interested in a detailed summary, I highly encourage reading the original paper or the summary I have written on their work:

Are Transformers Effective for Time Series Forecasting?

Do Transformers Lose to Linear Models?

In summary, self-attention has a few key problems when applied to the forecasting space. More specifically, prior time-series transformers had point-wise self-attention mechanisms where each individual time-stamp was treated as a token. However, this has two main issues. For one, this causes the attention to be permutation-invariant and the same attention values would be observed if you were to flip points around. Additionally, a single timestamp doesn't have a lot of information contained by itself and gets its importance from the timestamps around it. A parallel in language processing is if we focused on individual characters instead of words.

These problems led to a few interesting results when testing forecasting transformers:

Transformers seemed incredibly prone to overfitting on random patterns since adding noise to the data did not decrease the performance of the transformers significantly.
Longer lookback periods did not help with the accuracy, indicating that the transformers were unable to pick up on significant temporal patterns.

Here Comes PatchTST

In an attempt to address the issues that transformers have in this space, Nie et. al [2] introduced two main mechanisms that differentiate PatchTST from prior models: Channel Independence and Patching.

In previous works, all target time series would be concatenated together into a matrix where each row of the matrix is a single series and the columns are the input tokens (one for each timestamp). These input tokens would then be projected into the embedding space, and these embeddings were passed into a single attention layer

PatchTST instead opts for Channel Independence where each series is passed independently into the transformer backbone (Figure 1.b). This means that every series has its own set of attention weights, allowing the model to specialize better. This approach is commonly used in convolution networks and has been shown to significantly improve the accuracy of networks [2]. Channel independence also enables the use of the second mechanism: patching.

Figure 1: PatchTST Model Overview (Figure from Nie et. al 2023 [2])

As mentioned before, attending on single time-steps is like attending on a single character. You end up losing a sense of order ("dog "and "god" would have the same attention values with character-wise attention) and also increase the memory usage of the model at hand. So what's the solution? We need to attend to words obviously!

Or more accurately, the authors propose to split each input time series up into fixed-length patches [2]. These patches are then passed through dedicated channels as the input tokens to the main model (the length of the patch is the token size) [2]. The model then adds a positional encoding to each patch and runs it through a vanilla transformer [3] encoder.

This approach naturally allows PatchTST to take local semantic information into account that is lost with point-wise tokens. Additionally segmenting the time series significantly reduces the number of input tokens needed, allowing the model to capture information from longer sequences and dramatically reducing the amount of memory needed to train and predict [2]. Additionally, the patching mechanism also makes it viable to do representational learning on time series, making PatchTST an even more flexible model [2].

Forecasting Experimental Results

Figure 2: MSE & MAE results on benchmarks. First place is bolded, second place is underlined. (Figure from Nie et. al 2023)

In their paper, the authors test two different variations of PatchTST: one with 64 patches inputted into the model (hence the title) and one with 42 patches. The 42 patch variant has the same lookback window as the other models, therefore it can be thought of as a fair comparison to the other models. For both variants, a patch length of 16 and a stride of 8 were used to construct the input tokens [2]. As seen in Figure 2, the PatchTST variants dominate the results, with DLinear [1] winning in a very small number of cases. On average, PatchTST/64 achieved a 21% reduction in MSE and a 16.7% reduction in MAE. PatchTST/42 achieved a 20.2% reduction in MSE and a 16.4% reduction in MAE.

Conclusion

PatchTST represents a promising future for Transformer architectures in the time-series forecasting task, especially as patching is a simple and effective operator that can easily be implemented in future models. Additionally, the accuracy success of PatchTST indicates that time-series forecasting is indeed a complex task that could benefit from capturing complex non-linear interactions.

Resources and References

PatchTST Github repository: https://github.com/yuqinie98/PatchTST
An implementation of PatchTST can be found in NeuralForecast: https://nixtla.github.io/neuralforecast/
If you are interested in neural forecasting architectures that aren't transformers, consider reading my previous article on Neural Basis Analysis Networks: https://towardsdatascience.com/xai-for-forecasting-basis-expansion-17a16655b6e4

If you are interested in Forecasting, Deep Learning, and Explainable AI, consider supporting my writing by giving me a follow!

References

[1] A. Zeng, M. Chen, L. Zhang, Q. Xu. Are Transformers Effective for Time Series Forecasting? (2022). Thirty-Seventh AAAI Conference on Artificial Intelligence.

[2] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (2023). International Conference on Learning Representations, 2023.

[3]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin. Attention Is All You Need (2017). 31st Conference on Neural Information Processing Systems.

Tags: AI Deep Learning Forecasting Time Series Forecasting Transformers