Forever Learning: Why AI Struggles with Adapting to New Challenges

Author:Murphy | View: 25911 | Time: 2025-03-23 11:32:16

|AI|CONTINUAL LEARNING|DEEP LEARNING LIMITS|

"The wise adapt themselves to circumstances, as water moulds itself to the pitcher." – Chinese Proverb

"Adapt or perish, now as ever, is nature's inexorable imperative." – H. G. Wells

Artificial intelligence in recent years has made great progress. All of these systems use artificial neurons in some form. These algorithms are inspired by their biological counterparts. For example, the neuron aggregates information from previous neurons, and if the signal exceeds a certain threshold it passes the information to other neurons. This idea is represented by the matrix of weights and the activation function. Other examples can be found in convolutional networks (inspired by the visual cortex) or genetic algorithms. During the training process, the connections between various neurons (represented by the weights) are strengthened or diminished, similar to the strength of neuronal synapses. This process is the basis of the stochastic gradient descent (SGD) and the backpropagation algorithm, and over several decades has undergone minimal changes.

Cognition is Struggling: Natural and Artificial Brains Evolve from Constriction

Describing these similarities between intelligence and biological components helps to better understand but is also dangerous. First, the biological system is much more complex than people think, and forced simplifications are used. Second, some of these comparisons are inaccurate. This is the case with continual learning.

In this article we will answer these questions:

What is continual learning? Why is it important?
Do all the deep learning architecture suffer the loss of plasticity?
What causes loss of plasticity?
How to solve it?

The TL;DR and the references are at the end of the article.

What is continual learning?

The brain is extremely flexible and capable of adapting to new tasks and new types of information. In contrast, neural networks have problems adapting to changes in the data stream.

As an example, large language models (LLMs) are generalist models that are trained on a huge amount of data during pre-training. Fine-tuning is a system for making a model learn new knowledge or skills. There are two problems with fine-tuning, though:

During fine-tuning most of the weights are instead kept frozen.
It can lead to forgetting skills and knowledge, increase the risk of hallucinations

This makes continual learning not possible in practice. We risk compromising the functionality of the model. It is difficult to balance the effect of new data on the model's pre-acquired knowledge.

AI Hallucinations: Can Memory Hold the Answer?

Chat Quijote and the Windmills: Navigating AI Hallucinations on the Path to Accuracy

This stormy relationship between old and new data is currently not fully understood

The result is that these fine-tuning techniques are far from perfect. Therefore, it is often preferred to train a new model from scratch or use other strategies such as Retrieval Augmented Generation (RAG).

However, training a new model from scratch has a huge cost, but sometimes it is considered the only alternative. This is far from optimal, many real-world applications require instead that a model be adapted to change (prediction of financial markets, markets, logistics needs, control systems, and so on).

Why are neural networks unable to acquire new information?

There are two main issues:

Catastrophic forgetting. The model forgets what has previously been learnt
Loss of plasticity. The model is unable to learn new information or skills

We will focus on loss of plasticity

Loss of plasticity is then when we try to train a pre-trained model and it is unable to learn new information or new skills. More technically:

Ideally, this new training procedure is initialized from the parameters of yesterday's model, i.e., it is "warm-started" from those parameters rather than given a fresh initialization. However, warm-starting seems to hurt generalization in deep neural networks. – [1]

Models that are trained with a warm start (during continual learning) [10] perform worse in the test set. Thus, continual learning seems to damage the model's ability to generalize and thus adapt to new data.

Some studies suggest [2] that this stems from the fact that there is a critical phase for learning, and this phase both during the early epochs of training (a memorization phase) and then in the later phase is followed by a reduction of information (reorganization). Altering this initial phase damages the training and generalization itself.

Grokking: Learning Is Generalization and Not Memorization

Other studies seem to confirm that there are two phases of learning (memorization and refinement) [3]. However, this does not explain why plasticity is lost when new data are presented. Other studies suggest a role of gradient descent, loss surface, and architectural choices (normalization would promote better maintenance of plasticity) [4]. However, the question remains open and we will discuss it in detail later in this article.

Loss of plasticity is ubiquitous to all deep-learning models

Catastrophic forgetting is much more studied than the loss of plasticity. Unfortunately, very few studies have focused on it. Therefore, we do not know whether loss of plasticity is a general problem or a special case of some parameter choices.

To prove it affects all deep learning models, plasticity loss should be consistent across different architectures, parameters, and algorithms used. However, today's models have billions of parameters making a systematic operation complex. In this study [5], they tried to remedy this shortcoming by testing on two main datasets: ImageNet and CIFAR.

ImageNet consists of 1000 image classes (1M images in total) and it is the most known image classification benchmark. The authors conducted 0.5 M binary classification tasks (taking two classes at a time) to test the loss of plasticity in continual learning. In other words, the authors train a model to separate ‘dogs' and ‘cats,' then train it on a new task (distinguishing ‘leaves' and ‘flowers'), and so on. Accuracy is measured after the end of the first task and then after each of the subsequent tasks. The difficulty of the tasks is the same, but if the model has a drop in performance, it means it has lost plasticity.

The authors tested different types of deep learning networks and different parameters in these settings. Using standard backpropagation the models perform well in the first few tasks, but then they quickly lose plasticity until they perform similarly to linear models. Thus a model that is well-tuned for one task begins to rapidly lose performance when presented with a new task until it falls below the baseline.

The study confirms that regularization helps maintain neural plasticity. More generally, regularization approaches aim to keep network weights small. L2 regularization seems to help to maintain plasticity but the reason is not well understood.

In a later experiment, the authors used CIFAR (one of the most popular image datasets, consisting of 100 classes). They take an 18-layer ResNet model [11] that contains residual connections (practically one of the most widely used models for computer vision) and begin training it on 5 classes. After that, they start adding more classes until they reach 100 classes.

After each addition, the model is tested for performance on all available classes. This could be considered the case of a model being trained while a dataset is continuously enlarged (like a social network over time). Because the authors focus on plasticity and not forgetting, the old classes are not removed when the new classes are added. In parallel, the authors train models that are trained from scratch on all available classes up to that point (if the model is first trained on five classes and then trained in the second iteration on another 5 classes, the from-scratch model is trained directly on 10 classes)

Initially, it seems that incremental training is better than retraining, but as the classes increase the model loses plasticity. Adding image classes, performance starts to deteriorate more and more. After a few classes, the improvement on the baseline (model trained from scratch) is lost, and by adding classes the performance deteriorates significantly. Again the performance deterioration is less in the case of normalization techniques (Shrink and Perturb [12] is an algorithm that also uses L2 normalization).

Continual learning has an important use in reinforcement learning. An agent must be able to explore and learn from the environment, and the environment can change. For example, in a video game, the first levels may be very different from the last levels and require the agent to adapt to new challenges.

In this study [5], the authors analyze the behavior of an ant-like robot that explores its surroundings and receives rewards. Every few million steps they changed the friction coefficient so that the model has to relearn how to walk (simulating a new task for a robot).

Again, the model shows a reduction in performance and lack of plasticity. it is interesting that even in these settings regularization techniques improve plasticity.

We know now that loss of plasticity is ubiquitous, but why does it occur?

The causes of plasticity loss

The fact that regularization techniques help maintain plasticity is an indication that plasticity is related to some property of model weights. After all, regularization techniques put constraints on the weights.

In a sense, what is really changing in a model over time are its weights. These are random initialized but then the weights are optimized while a model learns a task. Later, if we train for another task these weights should be optimized for the next task (and so on). This does not happen, because the model loses plasticity. So initially these weights can be optimized for a task, meaning that in the first epochs, they must have some particular property that allows them to learn (one or more properties that are later lost during training).

The loss of these properties should explain the loss of plasticity.

We can study what properties of the weights change during training, especially when the model starts to lose plasticity. This should help us understand the causes.

During training, concurrently with the loss of plasticity there is an increase in the fraction of constant units. When a neuron becomes constant, the gradient arising from it becomes zero or close to zero. Likewise, this weight no longer changes, and we can define it as no longer adaptive or plastic. In the case of ReLU activations, neurons are defined as dead when for each input they produce zero [6–7]. We can notice that the loss of plasticity is accompanied by an increase in dead neurons [5].

Once a unit dies it remains dead forever. This means that an increase in dead neurons corresponds to a decrease in the network's ability to learn (less active neurons, less the capacity of the neural network).

Another interesting phenomenon there is an increase in the average magnitude of the weights while performance is degrading. In general, the weight magnitude growth is associated with slower learning and reduced speed of convergence in gradient descent.

A third interesting phenomenon concurring with the loss of plasticity is the drop in the effective rank of the representation. The effective rank considers how each dimension affects the transformation induced by a matrix. Simply put, the effective rank is related to the amount of information. The fewer dimensions containing important information, the more the matrix is filled with redundant information. For the weight matrix of a neural network, the effective rank of a hidden layer represents the number of neurons it takes to produce the layer output. The lower it is, the fewer neurons it takes to produce the output of the layer (so most neurons do not produce useful information). As training increases, the effective rank of the network decreases, so fewer neurons produce relevant information and less representation ability has the network. This does not help in learning new information because we can rely on a few useful neurons.

These factors explain why we used to see improvement with regularization techniques earlier. L2-regularization indeed reduces the magnitude of the weights but does not affect dead units and effective rank. Shrink and Perturb is a combination of L2-regularization and random Gaussian noise injection, so it also reduces the number of dead units. Neither of these techniques solves the third problem, though.

How can we improve neuronal plasticity in a model?

Improving the network plasticity

We need a method that allows us to have small weights, few dead neurons (reduce dormancy), and maintain variability in the network. By knowing what the needs are, we can modify the way neural networks learn to maintain plasticity.

In the initial step of backpropagation, the weights are initialized randomly, which allows for high variability. This variability is then lost during training (along with plasticity). We could then add variability by reinitializing some of the weights. We must be careful, though, to avoid destroying what the network learned. Therefore we should reinitialize only few and only those that are not used by the network. As a general intuition, the activation of a neuron gives us information about how valuable it is. If one neuron's contribution to the others is low, it is not conveying important information so we can reinitialize them

This method is called continual backpropagation and allows much more plasticity to be maintained.

adapted by the author. image source: [5]

Continual backpropagation thus seems to maintain network plasticity for a long time. Too bad the authors did not test it on LLM.

Parting thoughts

In general, most studies of continual learning have focused on maintaining network stability (retaining information learned in previous tasks and thus avoiding catastrophic forgetting). The lack of plasticity though affects the network's ability to acquire new information and new skills and is equitably important in continual learning. Recent studies elucidate why this loss of plasticity occurs. And it is interesting how the loss of plasticity is intrinsic to the back-propagation algorithms themselves. Indeed, this is demonstrated by the fact that the lack of plasticity is ubiquitous in various architectures and tasks [5].

On the one hand regularization techniques promote plasticity, while apparently other common parameter choices worsen the problem (dropout, ADAM optimizer). Until recently we did not know why; today we know that it is a subtle balance between weight explosion, neuron dormancy, and efficiency rank [5, 8, 9]. Therefore, we can modify back-propagation to take these factors into account.

Continual back-propagation [5] is an interesting method for maintaining plasticity. it is a simple method and is not computationally intensive, it reinitializes neurons that typically contribute little (these neurons are often the ones that are pruned in techniques that try to reduce the size of a model).

Continual back-propagation uses a utility measure to find and replace low-utility units, which means it is based on a heuristic and therefore is not optimal. Also, most of these studies are done on toy models (even when they are extensive benchmarks) and not on models like LLM. It would be interesting to see how approaches like continual back-propagation would work on LLMs and whether they allow these models to learn new knowledge or new skills.

In any case, loss of plasticity is a stark difference between natural and artificial neurons. Continual learning is important for many applications, such as a robot encountering a new terrain, or adapting LLMs to specialized domains and new tasks. Today we know better what originates the problem and how to improve the plasticity of models, but still there are open questions (and need of additional researches).

What are your thoughts? Have you tried models for continual learning? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Can AI Replace Human Researchers

Safekeep Science's Future: Can LLMs Transform Peer Review?

Knowledge is Nothing Without Reasoning: Unlocking the Full Potential of RAG through Self-Reasoning

Short and Sweet: Enhancing LLM Performance with Constrained Chain-of-Thought

TL;DR

Today's neural networks are not capable of Continual Learning. This is because they are incapable of retaining previously learned information (catastrophic forgetting) or learning new information after training (loss of plasticity).
Loss of plasticity is a plague of all deep learning models. No matter the architecture, hyperparameters, or loss function, loss of plasticity is ubiquitous. Though regularization techniques help the model maintain plasticity, this means that weights are related to loss of plasticity.
Increased dead units, explosion of weights, and loss of effective network rank are the causes of loss of plasticity. Regularization techniques can act on the first two causes but we need a solution to the third problem.
We can improve loss of plasticity by employing back-propagation modifications that somehow reinitialize neurons that are no longer used by the network. Continual backpropagation is an example of this

Reference

Here is the list of the principal references I consulted to write this article (only the first author name of an article is cited).

Ash, 2020, On Warm-Starting Neural Network Training, link
Achille, 2019, Critical learning periods in deep networks, link
Berariu, 2023, A study on the plasticity of neural networks, link
Lyle, 2023, Understanding Plasticity in Neural Networks, link
Dohare, 2024, Loss of plasticity in deep continual learning, link, code
Lu, 2019, Dying ReLU and Initialization: Theory and Numerical Examples, link
StackExchange, What is the "dying ReLU" problem in neural networks? link
Lyle, 2024, Disentangling the Causes of Plasticity Loss in Neural Networks, link
Lewandowski, 2023, Directions of Curvature as an Explanation for Loss of Plasticity, link
Wang, 2023, A Comprehensive Survey of Continual Learning: Theory, Method and Application, link
He, 2015, Deep Residual Learning for Image Recognition, link
Chebykin, 2023, Shrink-Perturb Improves Architecture Mixing during Population Based Training for Neural Architecture Search, link