An Introduction To Deep Learning For Sequential Data

Sequential data like time series and natural language require models that can capture ordering and context. While time series analysis focuses on forecasting based on temporal patterns, natural language processing aims to extract semantic meaning from word sequences.
Though distinct tasks, both data types have long-range dependencies where distant elements influence predictions. As deep learning has advanced, model architectures initially developed for one domain have been adapted to the other.
Sequential data
Time series and natural language have both a sequential structure, where the position of an observation in the sequence matters greatly.


A time series is a set of observations over time that are ordered chronologically and sampled at fixed time intervals. Some examples include:
- Stock prices every day
- Server metrics every hour
- Temperature readings every second
The key attribute of time series data is that the ordering of observations is meaningful. Values nearby in time are usually highly dependent – knowing recent values gives insight into predicting the next value. Time series analysis aims to model these temporal dependencies to understand patterns and make forecasts.
Text data is also sequential – the order of words conveys meaning and context. For example:
- John threw the ball
- The ball threw John
While both sentences contain the same words, their meaning changes entirely based on word order. These temporal relationships are represented in language models and are the key to natural language tasks like translation and summarization.
Both time series and text exhibit long-range dependencies – values far apart in the sequence still influence each other. Also, local patterns repeat across different locations.
Time series and text representation in neural network
Text data need to be converted to embeddings to make them readable from a machine.
Vector representations called embeddings are learned from large datasets to capture semantic meaning and relationships between words or data points. The embedding vectors encode different semantic properties in each element, representing words/data in a dense, low-dimensional way for machine learning models. Embeddings can be pre-trained on large corpora and then fine-tuned for specific tasks.

There are more things to keep in mind while analyzing time series, like trends and seasonality. But when it comes down to how these data are represented in a neural network, the difference between text and time series ultimately comes down to the fact that time series are a sequence of values, while text is a sequence of vectors.
Tasks for sequential data
When examining sequential data, the most intuitive next step would be to predict what comes next in the sequence.
In Time series forecasting you're trying to predict a continuous value (like tomorrow's stock price or the temperature next week) based on past data. The model is trained to minimize the difference between its predictions and the actual values, a common characteristic of regression tasks.
Text generation – or, more appropriately, next-token prediction – consists in training a model to predict the next token given the previous ones. Autoregressive language modeling can be viewed as a multi-class classification problem, where you can think of each possible token as a separate class. The output is a probability distribution over all possible tokens in the vocabulary.

Other tasks involve sentence classification – categorizing sentences or documents into predefined classes, and time series classification. An example is sentiment analysis, a task where each text is categorized in a Positive and Negative class. Time series can be classified too, for example one can classify heartbeats are Healthy or Disease to detect anomalies.

Here, the models require training on datasets of manually annotated examples to learn how to map textual or time series features to categorical labels.
Modeling sequential data
Before today's powerful neural networks were created for time series forecasting and natural language processing, different models were typically used for these tasks.
Statistical methods like autoregressive integrated moving average (ARIMA) and exponential smoothing models were popular for time series forecasting before 2010s. These rely on mathematical relationships between past values in a time series to predict future values. While effective on some data, they make rigid assumptions that limit performance on complex real-world time series.
In NLP, tasks like language translation and speech recognition were historically addressed using rule-based systems. These encode human-crafted rules and grammar, requiring extensive manual effort and struggling with the nuance and variability of real human language. Alternately, naive Bayes, logistic regression, and other classical machine learning models were sometimes applied but could not effectively capture long-term context and dependencies in textual data.

The introduction of RNN and LSTM networks allowed contextual learning for time series forecasting and NLP. Rather than relying on rigid statistical assumptions or simple input-output mappings, RNNs can learn long-range dependencies from sequential data. This breakthrough enabled them to excel on problems like language modeling, sentiment analysis, and non-linear forecasting where classical approaches wouldn't work. Though introduced in the 1980s, these models have only become practical in the last decade, as computational power has dramatically increased. Google started using LSTM in Google Voice in 2015. [1]
Recurrent Neural Networks
RNNs contain recursive connections that allow information to persist across timesteps.
When working on forecasting, the RNN can be trained on past observations from a time series to learn the temporal patterns. The RNN processes the sequence by updating its hidden state at each time step based on the current input and previous hidden state.
For next token prediction, the RNN is trained on textual sequences like sentences where each token is a word. The RNN learns to predict the next word based on the previous words. The hidden state maintains the context of earlier words to inform the next prediction. At each step, the RNN outputs a probability distribution over the next token.


The ability of RNNs to remember past context has been transformational for sequence tasks in NLP and time series analysis. However, they can struggle with long-term dependencies due to issues like vanishing and exploding gradients. This issue motivated architectural advances like LSTMs to improve gradient flow across many timesteps and has been further enhanced using attention-based models.
Transformers
Attention mechanisms made possible all the amazing LLMs we know today. They were initially introduced to augment RNNs by allowing models to focus on relevant parts of the input sequence when making predictions. Attention functions score the importance of each timestep and use these weights to extract relevant context.
Attention has become an indispensable component for sequence tasks across NLP and time series modeling. It improves model accuracy and interpretability by focusing on relevant inputs.

The Transformer architecture relying entirely on self-attention has led to breakthrough results in NLP and time series modeling. The self-attention layers allow the modeling of dependencies regardless of the distance between sequence elements as long as the sequence fits in the context length.
Transformers have become state-of-the-art for sequential data, with this architecture adapted to NLP as BERT and time series as Temporal Fusion Transformer.
Towards Foundation Models for Time Series
A foundation model is a large machine learning model that can be trained on vast amount of data and then adapted to various tasks. Foundation models differ from traditional machine learning models, which usually perform specific tasks. They are more general and flexible and can be used as a starting point for developing more specialized applications. Avoiding expensive training from scratch can significantly reduce the time and cost of building new applications.
In NLP, Large Language Models allow in-context learning – they can perform new tasks they weren't explicitly trained for. This revolutionary capability makes ChatGPT and other LLMs so powerful, as they can generalize to a wide variety of tasks.
Most current forecasting approaches must be individually fit to each new dataset. This process is time-consuming and requires domain expertise. To address this problem, the concept of foundation models has lately been applied to time series data.

TimeGPT is a Transformer-based neural network pre-trained on a diverse dataset of over 100 billion time series data points encompassing domains like economics, weather, transport, retail sales, etc. The key innovation is that, like GPT-3, TimeGPT can generalize to make accurate forecasts on new time series data without retraining on each new dataset. This zero-shot ability provides immense time and resource savings compared to traditional forecasting pipelines. A foundation model simplifies forecasting to a single model that can be applied to any time series with just a few lines of code. [2]
Takeaways
When doing deep learning, think outside the box. Data and models have more in common than they appear – everything is interconnected. Both time series analysis and NLP are rapidly innovating and sharing ideas.
Time series and NLP share many parallels as sequential data types. We model both with architectures such as RNN, LSTM, and Transformers. As deep learning advances, we expect techniques to continue crossing over between these domains.
The 2010s were the decade of neural networks conquering domains once dominated by statistical models. The 2020s look set to be the decade of transformers cementing their dominance, and researchers continue pushing the boundaries of these formidable models.
Enjoyed this article? Get weekly Data Science interview questions delivered to your inbox by subscribing to my newsletter, The Data Interview.
Also, you can find me on LinkedIn.
References
[1] Long short-term memory – Wikipedia