Continual Learning: A Primer

Author:Murphy | View: 27601 | Time: 2025-03-22 20:03:15

Training large language models currently costs somewhere between $4.3 Million (GPT3) and $191 Million (Gemini) [1]. As soon as new text data is available, for example through licensing agreements, re-training with this data can improve model performance. However, at these costs (and not just at these levels; which company has $1 Million to spare for just the final training, not speaking of preliminary experiments?), frequent re-training from scratch is prohibitively expensive.

This is where Continual Learning (CL) jumps in. In CL, data arrives incrementally over time, and cannot be (fully) stored. The machine learning model is trained solely on the new data; the challenge here is catastrophic forgetting: performance on old data drops. The reason for the performance drop is that the model adapts its weights to the current data only, as there is no incentive to retain information gained from previous data.

To combat forgetting and retain old knowledge, many methods have been proposed. These methods can be grouped into three central categories*: rehearsal-based, regularization-based, and architecture-based. In the following sections, I will detail each category and introduce select papers to explore further. While I focus on classification problems, all covered ideas are mostly equally valid for, e.g., regression tasks but might require adaptations. In the end, I recommend papers to further explore CL.

Rehearsal-based methods

Schematic view of the **rehearsal-based** category. Besides original data from the current task, data from old tasks is replayed from a small memory buffer. Image by the author.

Methods from the rehearsal-based category (also called: memory-, replay-based) maintain an additional small memory buffer. This buffer can either store samples from old tasks or hold generative models.

In the first case, the stored samples can be real samples [2], synthetic samples [3], or merely feature representations of old data [4]. As the memory size is commonly limited, the challenge is which samples (or features) to store and how to best exploit the stored data. Strategies here range from samples that are most representative of a data class (say, the most average cat image) [5] to ensuring diversity [6].

In the second case, the additional memory buffer is used to store one or more generative models. These models are maintained alongside the main neural network, and are trained to generate task-specific data. After training, these models can dynamically be queried for data of tasks that are no longer available. The generative networks usually are GANs (e.g., [7]) or VAEs (e.g., [8]).

In both cases, the replayed data mostly is combined with the current task's data to perform joint training, through other variants exist (e.g., [9]).

Architecture-based methods

Schematic view of the **architecture-based** category. Each task reserves (and possibly expands) specific parts of the neural network. Image by the author.

Methods from the architecture-based category usually dedicate parts (e.g., layers or individual neurons) of a neural network to specific tasks. Once a part has been claimed by a task, this task-specific region is not modified by subsequently arriving tasks. As task-specific weighs are not changed, catastrophic forgetting can be avoided altogether.

A downside is that only a limited number of tasks can "reserve" space within the network. Two directions exist that work around this problem.

The first direction is to expand the network architecture. Methods here include the well-known Progressive Network [10] and DEN [11]. The former adds new network branches (i.e., a stack of layers) for each task and reuses old frozen branches via lateral connections (i.e., going from layer i of one branch into a later layer j of a later branch). The latter dynamically expands the network size if capacity is considered to be insufficient.

The second direction uses task-specific and task-shared parts, so that all tasks draw from a (large) number of shared parameters and have a (small) set of task-specific parameters. Interesting works here range from maintaining a central parameter space [12] to overlaying multiple binary masks onto the same network [13, 14] (ref. 14 was the first CL paper I read!).

The challenge with second direction, using task-shared & task-specific parameters, is to not overwrite the task-shared region – which leads to the ida of parameter regularization, the third and final category in this primer.

Regularization-based methods

Schematic view of the **regularization-based** category. The forward pass utilizes all weights. During the backward pass, important weights are not updated. Image by the author.

Methods from the regularization-based category utilize techniques to first identify network parameters that are important for old tasks. They then regularize updates to parameters based on their importance: important weights are changed less, unimportant weights are changed more during training. This is achieved by using one or more additional loss terms that increase if the important weights are to be changed more. In neural network training, a lower loss usually is better, and, hence, the neural network avoids updating important weights.

From my experience, most published research falls into this category (though not necessarily just into this single category, as methods often combine ideas from multiple categories). Among regularization-methods, elastic weight consolidation [15] is one of the most established (and oldest) methods. It regularizes updates to important weights through an additional loss term, and various successors have been proposed (e.g., [16, 17])

A very interesting paper is the gradient projection memory paper (GPM) [18]. I read it a couple of months ago, and remember two things vividly:

I had to take a nap in between to let the math sink
I went partying afterward (unrelated to the reading, but still nice)

The idea behind this paper builds on the idea that gradients give directions in a n-dimensional space. Here, each task (and thus its gradients) has specific space, the so called core space. This space contains the knowledge to perform the corresponding task. In training for new tasks, GPM regularizes updates to these special spaces and guides them to be orthogonal. This pushes new tasks to inhabit different spaces while leaving reserved spaces (mostly) as-is. As a result, tasks cause minimal inference to each other.

Conclusion and recommended reading

In this short primer, I discussed the three* main directions in continual learning (CL) research: replay-based, architecture-based, and regularization-based methods. Replay-based methods maintain a memory buffer to store old data, architecture-based methods delegate network parameters to specific parts, and regularization-based methods regularize updates to important parameters.

Within each category, I touched upon several papers that can serve as starting points for your own research in CL. If you are unsure about how to get started, I can recommend the following papers (titles given) in roughly the following order:

Experience Replay
Progressive Neural Networks
Elastic Weight Consolidation
Three Scenarios for Continual Learning
A Comprehensive Survey of CL: Theory, Method, Application
Forget-free Continual Learning with Winning Subnetworks
Is Forgetting Less a Good Inductive Bias for Forward Transfer?

Are there further papers that you can recommend? Do let me know in the comments!

References

*some surveys also use five directions, but the three directions presented in this article are the most common ones

[1] https://www.visualcapitalist.com/training-costs-of-ai-models-over-time/; accessed 13. October 2024

[2] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc'Aurelio Ranzato. 2019. On tiny episodic memories in continual learning, In arXiv

[4] Xialei Liu, Chenshen Wu, Mikel Menta, Luis Herranz, Bogdan Raducanu, Andrew D Bagdanov, Shangling Jui, and Joost van de Weijer. Generative feature replay for class-incremental learning. 2020. In CVPR Workshops, pages 226–227

[5] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. 2017. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001–2010.

[6] Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow memory: Continual learning with a memory of diverse samples. ,2021. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8218-8227.

[7] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. 2017. Advances in Neural Information Processing Systems, 30, 2017.

[8] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. 2018. In International Conference on Learning Representations, 2018.

[9] Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. 2021. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6993–7001.

[10] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. 2016. arXiv preprint arXiv:1606.04671.

[11] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong Learning with dynamically expandable networks. 2018. In International Conference on Learning Representations.

[12] Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and order-robust continual learning with additive parameter decomposition. 2019. In International Conference on Learning Representations.

[13] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. 2018. In International Conference on Machine Learning, pages 4548–4557. PMLR.

[14] Haeyong Kang, Rusty John Lloyd Mina, Sultan Rizky Hikmawan Madjid, Jaehong Yoon, Mark Hasegawa-Johnson, Sung Ju Hwang, and Chang D Yoo. 2022. Forget-free continual learning with winning subnetworks. In International Conference on Machine Learning, pages 10734–10750. PMLR.

[15] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. 2017. Proceedings of the National Academy of Sciences, 114(13):3521–3526.

[16] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. 2018. In International Conference on Machine Learning, pages 4528–4537. PMLR.

[17] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. 2017. In International Conference on Machine Learning, pages 3987–3995. PMLR.

[18] Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. 2020. In International Conference on Learning Representations.

Tags: Continual Learning Deep Learning Lifelong Learning Llm Machine Learning