Parameter-Efficient Fine-Tuning (PEFT) for LLMs: A Comprehensive Introduction

Author:Murphy | View: 28421 | Time: 2025-03-23 13:00:08

Image created by DALL-E. A Sunday Afternoon on the Island of La Grande Jatte but everyone is a humanoid.

Large Language Models (LLMs) are quite large by name. These models usually have anywhere from 7 to 70 billion parameters. To load a 70 billion parameter model in full precision would require 280 GB of GPU memory! To train that model you would update billions of tokens over millions or billions of documents. The computation required is substantial for updating those parameters. The self-supervised training of these models is expensive, costing companies up to $100 million.

For the rest of us, there is significant interest in adapting our data to these models. With our limited datasets (in comparison) and lacking computing power, how do we create models that can improve on the major players at a fraction of the cost?

This is where the research field of Parameter-Efficient Fine-Tuning (PEFT) comes into play. Through various techniques, which we will soon explore in detail, we can augment small sections of these models so they are better suited to the tasks we aim to complete.

After reading this article, you will conceptually grasp each PEFT technique applied in Hugging Face and be able to distinguish the differences between them. One of the most helpful overviews I found before this article was from a Reddit comment. There's also another exceptional article available from lightning.ai (the creators of pytorch lightning.) Additionally, there's a comprehensive survey that much of this piece is based on, authored by Liali et al [2]. In my article, I aim to address the gaps I identified while reviewing this material. At the time of writing, this article serves as a conceptual guide to all the PEFT methods present in the Hugging Face library. The goal for readers is to approach the research literature for other PEFT techniques with a fundamental understanding of the field.

A Moment for Self-Reflection: Is it time to fine-tune?

I wrote a previous article about considerations regarding fine-tuning LLMs and how similar performance could be achieved through In-Context Learning. Since then Llama 2 has been released and there has been great improvements in the open source LLM world. Here are some expanded thoughts I can share that extend beyond that article.

Fine-tuning is inherently dangerous for your organization. In a recent paper it was shown that LLMs can remember at least 1% of their training data [1]. If you have potential data duplication, that floor of 1% goes up even higher. If your fine-tuned LLMs will be used by non-internal users, ask yourself if it's okay to give them the data you are going to train on. Users can do malicious things to your model such as a prompt injection attack. I made a LinkedIn post about these security risks that serves as a quick overview. In the event that you can't give away your data, dynamic observation selection with ICL is one of your best options (see my other article for details.)

You must also prioritize the creation of high-quality data labels for your learning task. If the organization's commitment to top-notch data is lacking, particularly in support of your project's fine-tuning, I recommend considering an alternative approach. Models thrive on high-quality labeled inputs. If the commitment from your stakeholders is not there for human labelers, you risk disappointing all parties involved.

Who Even Uses PEFT?

PEFT is used by most providers that offer the ability to fine-tune language models. If the provider doesn't already make use of these techniques, I guarantee they have plans to. This article covers all of the techniques from Hugging Face PEFT that are available at time of writing. The survey from Lialin et al. [2] is referenced by Google in their introductory video about tuning foundation models on Vertex AI. While Vertex AI is more of a black box, I have heard use of adapters, prompt-tuning, and recently LoRa from sales pitches. It's unclear exactly what they are use, but at it's core the techniques we discuss here are what's powering things.

OpenAI does offer fine-tuning, but however famously has not implemented any PEFT methods yet. This is based on a blog post that was requested to be deleted by OpenAI a few months ago. The article details OpenAI does not use Adapters or LoRa to make fine-tuning more compute friendly. There has been no announcement from OpenAI that this has been implemented, so the safe assumption is that these features are not available to users yet. It is included on the roadmap for OpenAI and since fine-tuning is much more lucrative than normal model use I suspect it will be available in the near future.

Quick Transformer Review

I assume that readers of this article are familiar with the Transformer architecture. You don't need to be intimately invested in the details of self-attention or any components, but you should have glanced at Vaswani et al. [3] and maybe had a walkthrough of The Annotated Transformer (in my opinion that's the best resource to learn the Transformer.)

I am going to include this pseudo code for the transformer block. If you don't know much about transformers, just know that at it's core they do this:

def self_attention(x):
    k = x @ W_k
    q = x @ W_q
    v = x @ W_v
    return softmax(q @ k.T) @ v

def transformer_block(x):
    """ Pseudo code by author based on [2] """
    residual = x
    x = self_attention(x)
    x = layer_norm(x + residual)
    residual = x
    x = FFN(x)
    x = layer_norm(x + residual)
    return x

All functions from that pseudo code are as described in Vaswani et al. The FFN is a Feed Forward Network, which is 2 layers for our purposes. Many PEFT techniques that follow make changes to the transformer block or to self-attention, so I'll reference and change this pseudo code as we move through the guide.

A Tour through PEFT Methods

Overview of methods and classes from [2].

We'll go through each technique by looking at the broader classes in the diagram above. The classes we will cover are Additive, Adapters, Soft-Prompts, Reparameterization, and one Hybrid method that is a combination of Reparameterization and Selective (that isn't Sparse LoRa).

Additive Methods

Additive methods are probably the easiest to grasp. The goal of additive methods is to add an additional set of parameters or network layers to augment the model. When fine-tuning the data you update the weights only of these newly added parameters. This makes training computationally easier and also adapts to smaller datasets (think in the ballpark of 100–500 for starters, with a ceiling near 100,000.)

Method: Adapters

Adapters are simultaneously a method and a class. This technique was introduced in Houlsby et al [4]. The goal of adapters is to add small fully connected networks after Transformer sub-layers and learn those parameters. I follow the definitions from [2] and keep a strict definition of adapters as only adding fully connected layers to the network.

Houlsby et al. propose a simple update to the transformer block. They add fully connected layers in two places as shown in this pseudo code.

def transformer_block_adapter(x):
    """Pseudo code from [2] """
    residual = x
    x = self_attention(x)
    x = FFN(x)  # adapter
    x = layer_norm(x + residual)
    residual = x
    x = FFN(x)
    x = FFN(x)  # adapter
    x = layer_norm(x + residual)
    return x

Method: (IA)³

Infused Adapter by Inhibiting and Amplifying Inner Activations, or (IA)³ is a very interesting additive method (adding parameters) that augments the transformer block with some new parameters. It was proposed by Liu et al. [5] in 2022. Despite the name, this is not an adapter method since it does not strictly add fully connected layers after the sub-layers of the transformer block .

Let's consider the scaled dot-product attention found in a normal transformer:

Since we are working with an additive method, we are seeking to add parameters to this network. We want the dimensionality to be quite small. (IA)³ proposes the following new vectors to be added to the attention mechanism:

We just added column vectors l_k and l_v and take the Hadamard product between the column vector and the matrix (multiple the column vector against all columns of the matrix).

We also introduce one other learnable column vector l_{ff} that is added to the feed forward layers as follow:

In this example, gamma is the activation function applied to the product between the weights and input. Here is some pseudo code for (IA)³:

def self_attention_ia3(x):
    k = x @ W_k
    q = x @ W_q
    v = x @ W_v

    k = l_k @ k  # ia3
    v = l_v @ v  # ia3

    return softmax(q @ k.T) @ v

def transformer_block_ia3(x):
    """Pseudo code from [2]"""
    residual = x
    x = self_attention_ia3(x)
    x = layer_norm(x + residual)
    residual = x
    x = x @ W_1  # normal transformer
    x = l_ff * gelu(x)  # ia3
    x = x @ W_2
    x = layer_norm(x + residual)
    return x

Soft-Prompts

To understand soft-prompts, let's first discuss hard-prompts, a concept that most readers might be familiar with, even if not by name. In hard prompting, we put together a dataset of prompts that represent the task at hand. When someone interacts with the network by posing a question, they could phrase it in different ways. With hard prompting, the process involves curating a dataset that covers the various ways a particular task could be framed for the language model.

Soft-prompting is a technique that tries to avoid this dataset creation. In hard prompting, we are creating data in a discrete representation (picking words.) In soft-prompting, we seek a continuous representation of the text we will input to the model. This does imply you need one static prompt for the examples you are training on.

Depending on the technique, there are different methods for how the information is added to the network. The core idea is that the base model does not optimize the text itself but rather the continuous representation (i.e. some type of learnable tensor) of the prompt text. This can be some form of embedding or some transformation applied to that embedding. These techniques will be explored in more detail as we move on.

Method: Prompt-Tuning

Image from prompt-tuning from Lester et [11]. This shows that in prompt-tuning we concatenate the soft-prompt and the input text's representation (embedding) to the pre-trained model. Doing this allows us to optimize our representation of the soft-prompt through a learnable tensor.

Prompt tuning is a technique from Lester et al. [11] that falls into the category of soft-prompts. With soft-prompts our goal is to add information to the base model that is more specific to our current task. With prompt tuning we accomplish this by creating a set of parameters for the prompt tokens and injecting this at the beginning of the network.

To find a representation of the soft prompt, we create a separate set of embeddings for the static prompt used during training. We concatenate the output embeddings with the sequence embeddings. We use this new information to pass into the language model. Creating this dual information allows us to learn a parameterization of the soft prompt without needing to create many prompts for the same task.

def prompt_tuning(seq_tokens, prompt_tokens):
    """ Pseudo code from [2]. """
    x = seq_embedding(seq_tokens)
    soft_prompt = prompt_embedding(prompt_tokens)
    model_input = concat([soft_prompt, x], dim=seq)
    return model(model_input)

There are a number of rich benefits to fine-tuning through this approach. This new set of parameters can be very small, around 0.01% the size of the tunable parameters of the base model. This creates opportunities to have an ensemble of task specific models that all run off the same base model, which drastically decreases memory requirements for the model. For more on this, check out this post I shared on LinkedIn and also the section on ensembling in [3].

Method: Prefix Tuning

Prefix tuning is another soft prompting technique that is very similar to prompt tuning. In prompt tuning we created a separate set of parameters that we fed our input into and appended the outputs to the continuous representation of the text input in the model. In Prefix tuning we also find a continuous representation from a separate set of prompt tokens that are input into the base model.

The difference between prefix tuning and prompt tuning is that the representation from prefix tuning is fed to all layers of the transformer whereas prompt tuning was only concatenated with the embeddings. Additionally, we also learn additional parameters for the soft prompt for prefix tuning in the form of a fully connected network. After training the FFN is discarded and we only use the soft-prompt as input.

def transformer_block_prefix_tuning(x, soft_prompt):
    """ Pseudo code from [2] """
    soft_prompt = FFN(soft_prompt)
    model_input = concat([soft_prompt, x], dim=seq)
    return model(model_input)

Method: P-Tuning

Image for P-Tuning from Liu et al [6]. This shows the creation of the prompt embedding throughout the prompt encoding being concatenated with the input embedding. The prompt encoder consists of and embedding, LSTM, and then some fully connected layers.

P-Tuning is another soft-prompting method introduced by Liu et al. [6] that differs from prompt and prefix tuning. Colloquially we can think of P-Tuning as prompt-tuning but encoding the prompt using an LSTM.

P-Tuning sets out to solve two problems the authors noticed. The first is the discreteness of the word embeddings passed to the model. The authors argue that if the embeddings are randomly initialized and then optimized through Stochastic Gradient Descent, the model is likely to fall into a local minima. They second is association of the word embeddings. With the parameterization in prompt-tuning and prefix-tuning, the soft prompts are technically independent of each other. The authors wanted an approach that made the prompt tokens dependent on each other.

The authors propose that a prompt is a function that takes a context x and a target y and organizes itself into a template T. The authors provide the example sequence "The capital of Britain is [MASK]". Here the prompt is "The capital of … is …", the context is "Britain" and the target is [MASK]. We can use this formulation to create two sequences of tokens, everything before the context and everything after the context before the target. We can learn a representation of this additional information that can be reduced to a continuous output and fed into the language model.

To embed the prompt in this way, we use a small network of an LSTM fed into a two layer FFN. We pass the prompt tokens, those before the context and those after and before the target.

def p_tuning(seq_tokens, prompt_tokens):
    """Pseudo code for p-tuning created by Author."""
    h = prompt_embedding(prompt_tokens)
    h = LSMT(h, bidirectional=True)
    h = FFN(h)

    x = seq_embedding(seq_tokens)
    model_input = concat([h, x], dim=seq)

    return model(model_input)

Method: LLaMA-Adapater

Image for LLaMA-Adapter from Zhang et al. [7] We can see the addition of zero-initialized attention being used on the adaption prompts and that those are the only thing fine-tuned.

LLaMA adapter is a soft-prompting technique introduced by Zhang et al. [7] that applies a more efficient version of prefix learning to the Llama model.

LLaMA-Adapter has a few key differences from Prefix Tuning. They introduce Adaptation Prompts, which are soft-prompts appended with the input to the transformer layer. These adaption prompts are inserted in the L topmost of the N transformer layers.

The authors also introduce zero initialized attention. With additive methods we introduce a new set of parameters that have some random initialization over the weights. Because of this random noise added to the LM, we can potentially experience unstable fine-tuning which can cause a problem with large loss values at the early stages. To solve this problem, the authors introduce a gating factor, initialized to 0, that is multiplied by the self attention mechanism. The product of the gating factor and self-attention is referred to as zero-init attention. The gating value is adaptively tuned over the training steps to create a smoother update of the network parameters.


def transformer_block_llama_adapter(x, soft_prompt, gating_factor):
    """LLaMA-Adapter pseudo code created by Author"""
    residual = x

    adaption_prompt = concat([soft_prompt, x], dim=seq)
    adaption_prompt = self_attention(adaption_prompt) * gating_factor  # zero-init attention

    x = self_attention(x)
    x = adaption_prompt * x
    x = layer_norm(x + residual)
    residual = x
    x = FFN(x)
    x = layer_norm(x + residual)

    return x

Reparameterization-Based Methods

Reparameterization-Based methods focus on finding low dimensional representations of the same weight matrices found in the base model. The first connection between fine-tuning and a low dimensional representation was shown in Hu et al [8]. The authors make a connection between the full parameters of the model and a lower dimensional representation. Depending on the task, the authors are able to achieve 90% of the results of the fully fine-tuned model with approximately 0.0002% of the trainable parameters.

Method: LoRa

Image taken from Hu & Shen et al [9]. Here we can see the pre-trained weights and the additional matrices A and B. A is initialized normally and B is initialized to 0. We train only A and B.

One of the most popular techniques in fine-tuning is a reparameterization-based method called Low-Rank Adaptation (LoRa) [9]. LoRa updates a weight matrix by learning a separate matrix which represents the updates from optimization. They go a step further to create two smaller dimension weight matrices to represent this difference. By creating smaller dimension weight matrices we have less parameters to learn.

To train LoRa we use the fundamental ideas of gradient descent, where we make incremental adjustments to a set of parameters that move us closer to our goal (loss function). In LoRa we choose to isolate all of our updates to a separate matrix. This matrix, we denote Delta W, represents all of the parameter updates we learn in this during the fine-tuning process.

Let's assign W_0 to have dimensions dxk (d rows and k columns). We want to update it's parameters so that it is aligned with our new goal. You can represent this update to the parameters by ΔW, which also the dimension dxk. We can model our update rule using the equation below.

Now let's change our update rule such that ΔW is modeled by a matrix multiplication AB. We assign matrix A to have dimensions dxr and B to have dimensions rxk. If you're up to speed on your matrix multiplications you will see that AB has the same dimensions as W_0, so the addition of these matrices is valid. Here's why AB is a better choice than DeltaW: Matrix A only has dxr and matrix B has rxk. If we make r a very small number (r=8 is a typical value), then the number of parameters in A and B is drastically smaller than ΔW. If we only learn the parameters of A and B, then we learn d*k-d*r-r*k less parameters. In practice this allows us to learn only 0.1–0.5% of the parameters of the original network.

The walk through I just illustrated is the essence of how LoRa works. Instead of optimizing a matrix W through additional training steps, we alter a matrix ΔW through two new matrices A and B that have far fewer parameters. This result helps us optimize many fewer parameters, which makes our training much more efficient.

Typically we use this update rule for the key and value matrices of the self attention in the transformer block. We also add a scaling factor, set to 1/r, to the amount of information given to the update. See the pseudo code below.

def lora_linear(x, W):
    scale = 1 / r  # r is rank
    h = x @ W
    h += x @ W_a @ W_b  # W_a,W_b determined based on W
    return scale * h

def self_attention_lora(x):
    """ Pseudo code from Lialin et al. [2]."""

    k = lora_linear(x, W_k)
    q = x @ W_q
    v = lora_linear(x, W_v)
    return softmax(q @ k.T) @ v

Selective Methods

With selective methods we choose some of the parameters to update and do not update others. The problem with these approaches is that we have created a sparse matrix of parameters. Sparse matrix operations are not well supported on present day GPUs and provide computation challenges. For more information on why sparse matrices produce computational challenges, check out [10].

There are also techniques in Selective methods that focus on pruning unsuccessful vectors or manipulating model biases. These also create additional complexity when attempting to train the model. In general these methods have more challenging implementations since their compute operations are more expensive than other operations.

Method: AdaLoRa

This is a hybrid approach that combines ideas from reparameterized and selective methods. Zhang et al [12] developed AdaLoRa by studying LoRa and formed the question "How can we allocate the parameter budget adaptively according to [the] importance of modules to improve the performance of parameter-efficient fine-tuning?" What this translates to is "How can we give preference to the parameters that lead to better performance rather than treating all parameters equally?"

Instead of using two matrices A and B as we did in LoRa, AdaLoRa uses an approximation of Singular Value Decomposition (SVD) to reduce the dimensionality of a vector space into three matrices: P (left singular vectors), Lambda (singular values), and Q (right singular vectors). Using these three matrices we can reconstruct an approximation of our vector space Delta as P Lambda Q. The benefit of using SVD is the singular values represent the importance of the vector in this lower dimensional space. The contribution of this paper is applying some efficiency implementations to use an SVD related approach to consider importance around which weights should be optimized.

In LoRa, we saw that we can approximate delta W with two matrices A and B. Here we can replace A and B with our new approximation P Lambda Q. Since lambda only has values along the diagonal (singular values) we store it as a column vector. We pick the dimensions of our matrices P (d x r), Lambda (r x r), and Q (r x k) to match the dimensions of a weight matrix W (d x k).

The other novel result is the idea of using a special importance sampling technique to determine which elements of the SVD can be pruned out. Essentially the technique is to consider a group of triplets (each entry of the SVD) and determine how important they are to the lower dimensional representation. They accomplish this by using a function that relates the singular values and left/right singular vectors. These are then run through a sensitivity function that combines an exponential moving average between steps of the gradient weight product (pseudo importance) and another function called the uncertainty quantification, that is also exponentially averaged over the current step and previous step.

While pruning elements of the SVD, the rank of the lower dimension (the r term of the matrices) is iteratively changed as they delete the least important triplets. They accomplish this through a global budget scheduler that decays the rank r over the training steps. The budget is initialized as 1.5 times the magnitude of the target budget and follows a cubic decrease to the target budget after t warm up steps.

This is conceptually a dense method to understand. If you're technically inclined I encourage reading through the paper to understand the inner workings of the method. If you remember this as an efficient SVD implementation applied to LoRa that combines pruning less important singular vectors, that is probably safe at a conceptual level.


def adalora_linear(x, W, curr_sv):
    scale = alpha / r  # r is rank
    h = x @ W

    # p, lamda, and q are related to the W matrix
    # curr_sv marks which singular vectors we are currently optimizing. 
    h += x @ p[curr_sv] @ lamda[curr_sv] @ q[curr_sv]
    return scale * h

def self_attention_lora(x):
    """
    AdaLoRa pseudo code created by author. 
    This only shows the difference in the self_attention block. 
    Does not include code for pruning techniques.
    """
    k = adalora_linear(x, W_k)
    q = x @ W_q
    v = adalora_linear(x, W_v)

    return softmax(q @ k.T) @ v

Comparison of Methods

To compare all of the methods in one place I created the table below to show their number of trainable parameters (which are all additional parameters to the network), the method type, and an informal summary of the method. The informal summary is how I would describe the method in one sentence to a college student who had never heard of it before.

Table adapted from Lialin et al [2]. Author contributed P-Tuning, LLaMA-Adapter, and AdaLoRa and informal summary column. Informal summary is how I would describe the paper in one sentence to a college student.

Is this the only conceptual guide you need?

I argue this is the only conceptual guide you need because after reading you understand the basics of PEFT techniques. If you noticed, all of the techniques expanded on the ideas from another. After this introduction you understand enough of the basics that you could explore the research papers yourself. However, if you end up needing another conceptual guide to understand the concepts, please share it in the comments of the article so other readers can find these resources!

Time to get started!

After this conceptual review you are at a great point to start experimenting with these methods to train your own models. There are plenty of great implementation guides available from Hugging Face. If you want a less hands on approach you can work with Google's Vertex AI models or work with OpenAI fine tuning.

Thank you for reading the article! If you have additional questions or something was unclear, leave a comment and I will get back to you. If you want to see more articles like this one, please follow me on Medium and on LinkedIn.

If you found a technical error in this article, please let me know ASAP! I strive to make sure the information I publish is as correct as possible, but no one is perfect.

References:

[1] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, & Chiyuan Zhang. (2023). Quantifying Memorization Across Neural Language Models.

[2] Vladislav Lialin, Vĳeta Deshpande, & Anna Rumshisky. (2023). Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning.

[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, & Illia Polosukhin (2017). Attention Is All You Need. CoRR, abs/1706.03762.

[4] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, & Sylvain Gelly (2019). Parameter-Efficient Transfer Learning for NLP. CoRR, abs/1902.00751.

[5] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, & Colin Raffel. (2022). Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning.

[6] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, & Jie Tang (2021). GPT Understands, Too. CoRR, abs/2103.10385.

[7] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, & Yu Qiao. (2023). LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.

[8] Armen Aghajanyan, Luke Zettlemoyer, & Sonal Gupta (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. CoRR, abs/2012.13255.

[9] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, & Weizhu Chen (2021). LoRA: Low-Rank Adaptation of Large Language Models. CoRR, abs/2106.09685.

[10] Trevor Gale, Matei Zaharia, Cliff Young, & Erich Elsen. (2020). Sparse GPU Kernels for Deep Learning.

[11] Brian Lester, Rami Al-Rfou, & Noah Constant. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning.

[12] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, & Tuo Zhao. (2023). Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning.

Tags: Artificial Intelligence Large Language Models Machine Learning NLP Thoughts And Theory