A Winding Road to Parameter Efficiency

Author:Murphy | View: 25081 | Time: 2025-03-22 23:31:14

Good news: Using LoRA for Parameter Efficient Finetuning (PEFT) can be straightforward. With a simple strategy of adapting all linear modules and some light tuning of the learning rate, you can achieve good performance. You could stop reading here!

But what if you want more? What if you're seeking a deeper understanding of which modules to tune, and how to optimize your model for performance, GPU memory utilization or training speed? If you're looking for a more nuanced understanding and control over these aspects, then you're in the right place.

Join me on this journey as we navigate the winding road to parameter efficiency. We'll delve into the deliberate design decisions that can help you to get the most out of LoRA while offering you more control and a better understanding of your model's performance. Let's embark on this exciting exploration together.

You would get the most out of this article if you already have at least a basic understanding of LoRA, like what we covered in the previous article. Furthermore we are optimizing a RoBERTa model [1], which uses the transformer architecture. A general understanding of the basic components helps, but is not absolutely necessary to follow along on the main subject.

In the previous article, we explored how to apply LoRA to train adapters that only require a fraction of the parameters needed for a full finetuning. We also saw how such an implementation might look like in code. However, our focus was primarily on the mechanical aspects. We did not address which modules to adapt, nor how to size the adapters for efficiency and performance.

Today, this is our focus.

We zoom out and recognize that there are a lot of algorithmic design decisions that we have to make, many of which influence each other. These are often expressed as hyperparameters by the original algorithm creators. To handle the sheer number of possible combinations of hyperparameters and their values we'll use a systematic approach to learn about the relative impact; of these design decisions. Our aim is not only to eventually achieve good performance for our model at hand, but we also want to run experiments to gather empirical feedback to strengthen our intuitive understanding of the model and its design. This will not only serve us well for today's model, task, and dataset, but much of what we learn will be transferable. It will give us greater confidence moving forward as we work on variations of the model, new tasks, and datasets in the future.

Execution of Experiments:

I will be using Amazon SageMaker Automatic Model Tuning (AMT) to run the experiments throughout this article. With AMT I will either deliberately explore and analyze the search space, or, automatically find a good combination of hyperparameter values.

As a side note, ‘tuning‘ is a term that serves two purposes in this article. On one hand, we use ‘hyperparameter tuning‘ to refer to the adjustment of hyperparameter values in model training, a process automated by SageMaker's Automatic Model Tuning. On the other hand, we use ‘tuning‘ to describe the process of starting with a pre-trained model and then finetuning its parameters (not the hyperparameters) for our specific downstream task.

To maintain focus, I will keep the implementation details in this article brief. However, you will find all the experiments with all their details in the linked notebooks.

I also encourage you to learn more background about using AMT, the differences between the search strategies Random Search and Bayesian Optimization, the concept of warm starting tuning jobs and about visualizing/analyzing the results. All of which, are discussed in this article:

Explore advanced techniques for hyperparameter optimization with Amazon SageMaker Automatic Model…

Baselines: What to compare to?

We will focus on architectural decisions:

Which modules should we adapt?
On what layers? All of them? Some? Just the middle layers?
How large should the module adapters be? What should r, the rank of the LoRA matrices, be?

However, before we start experimenting, how can we ensure that we are on the right track and that our changes have a positive impact? Let's define some baselines to compare our progress to.

If finding baselines for comparison does not appeal to you, feel free to skip ahead to the next section "What to tune?".

Over time, we hope to observe that our training runs are producing better results. But when are we done and can stop experimenting? Seeing no further improvements after a while could indicate that we have achieved the optimum. However, it could also mean that we have ran out of ideas to try out, even though more was possible.

Performance Expectations and ReproducibilityIn order to interpret the results of our experiments, we need to establish clear performance expectations for our model. This includes an understanding of the ideal performance as an upper bound, as well as the minimum performance we expect to see.

Deep learning is inherently noisy, meaning that no two runs will produce the exact same result. This raises important questions about the results we observe. Is the performance we're seeing reproducible using the hyperparameter values we tested with, or did we just get lucky with this particular run? To answer these questions, we need to validate a set of hyperparameter values that we've found to perform well. In this article I'll do this by running the same hyperparameter values five times to calculate the mean performance and its variance.

Expected performance – Full Finetuning: In our case reasoning about the expected performance is easy. We are finetuning a sentiment analysis task on the sst-2 dataset using the RoBERTa base model, as was done in the RoBERTa paper [1]. Therefore, we can directly use the numbers reported by the authors as a sanity check. We will align our setup and the hyperparameters used with those in the paper.

We still run the training ourselves, so that we have a verified setup and training procedure before we apply LoRA to it. Consequently, we can perform a sanity check to ensure that the numbers we observe roughly match those from the paper. If we cannot match the numbers, we would need to check our setup.

The RoBERTa paper [1] reported an accuracy of 94.8in table 8. This serves as our benchmark for expected performance during full fine-tuning. After checking that we are in the ball park of that number, we will use our own setup and the results as a baseline for comparing all the following experiments, which are derived from our setup.

Expected performance – LoRA Finetuning: This is easy as well. The promise of LoRA is to almost match the full finetuning performance, but with only a fraction of the parameters of a full finetuning. Hence, we will compare to our results from the full finetuning performance as described in the preceding section.

Expected minimum performance: One possible baseline would be random performance. For our task with two classes that would be 0.5. But we are not building a model from scratch and from the papers we already know that the LoRA approach is working very well, so random performance would not be an informative baseline.

Instead, let's use a baseline where we only train the classifier and keep the embeddings and transformer layers frozen, in the state they came from the pre-training. This should result in a much lower performance than a full finetuning, but much better than random, though. Importantly, it should also serve as a comparison point to reason about non-functional aspects like parameter efficiency, memory usage, and training throughput.

Comparing the baselines. The black bars in the "Model Performance" panel show standard deviation.

All scenarios above have been run five times, and the mean performance is shown in the diagram. You can also deduce that we are in the ballpark of the performance from the RoBERTa paper with the scenarios "Full Finetuning". As we hoped for, "LoRA Base" (adapting all linear modules) matches that performance, but uses fewer parameters. The scenario "Classifier Only" performs much worse, as expected, but is cheaper in terms of parameters and trains faster.

Moving forward, we will now take our numbers as baselines to compare future experiments to.

You can find more details in the accompanying notebook.

Execution of Experiments:

First, for each baseline, we search for an optimal learning rate parameter value. We use Bayesian Optimization to efficiently explore and then exploit the search space.

Second, the best hyperparameter values we found for a scenario may or may not necessarily reproduce good results. It could be that the hyperparameter values we identified are only the best relative to the other values we explored. Maybe the values we found were not relevant at all, e.g. the model was not sensitive in this value range? To estimate how good the findings hold up, for each scenario, we run the best combination of hyperparameter values again five times and report the observed standard deviation on the objective metric.

LoRA Base Scenario – First Result: It's encouraging to see that the LoRA finetuning approach, scenario "LoRA Base", is already performing on par with "Full Finetuning", despite it just using ~1% of the parameters. Furthermore, in this approach we are adapting all linear modules with the same adapter size (r=8). This is a simple starting point that apparently produces good performance despite its simplicity.

Secondary Hyperparameters: As a point of note, **** we primarily search for good values for the hyperparameter r and the modules we want to adapt. To keep things simple, we only tune very few additional hyperparameters. For the baselines it is just the learning rate and the number of epochs. We use Bayesian Optimization as the search strategy using Amazon SageMaker Automatic Model Tuning (AMT). We follow guidance from the referenced papers on setting other hyperparameters, such as weight decay and dropout. We keep those hyperparameters fixed throughout the article, so that we can isolate the impact of the hyperparameters that define the LoRA architecture, making it easier to see how our main hyperparameters influence performance.

Do you, dear reader, plan to repeat the steps from this article? Are you aiming to find the best hyperparameters for your own model, task, and dataset that you intend to use in production? If so, it would make sense to also include the secondary hyperparameters. Ideally, you should do this towards the end of your exploration and tuning effort – when you have already significantly narrowed the search scope – and then aim to further improve performance, even if just slightly.

Hyperparameters: What to tune?

Let's get started with our main activity.

The design decisions left for us in the model architecture are typically expressed as hyperparameters. For LoRA specifically, we can define which modules to adapt and how large r should be for each module's adapter. In the last article we only suggested selecting these modules based on our understanding of the task and the architecture.

Now, we'll dive deeper. Where should we apply finetuning at all?

Where to finetune? Classifier at the top, transformer layers and at the bottom the embeddings. Left: possible modules to adapt, right: Example selection.

In the illustration above, you can see all the potential modules that we could finetune–including the classifier and the embeddings–on the left. On the right, I've made a sample selection for the illustration . But how do we arrive at an actual selection? Let's look at our options from a high level:

ClassifierIt is clear that we absolutely need to train the classifier. This is because it has not been trained during pre-training and, hence, for our finetuning, it is randomly initialized. Furthermore, its central position makes it highly impactful on the model performance, as all information must flow through it. It also has the most immediate impact on the loss calculation as it starts at the classifier. Lastly, it has few parameters, therefore, it is efficient to train. In conclusion, we always finetune the classifier, but do not adapt it (with LoRA).
EmbeddingsThe embeddings reside at the bottom–close to the inputs–and carry the semantic meaning of the tokens. This is important for our downstream task. However, it's not "empty". Even without finetuning, we would get all of what was learned during pre-training. At this point, we are considering whether finetuning the embeddings directly would give us additional abilities and if our downstream task would benefit from a refined understanding of the token meanings? Let's reflect. If this were the case, could this additional knowledge not also be learned in one of the layers above the embeddings, perhaps even more efficiently? Finally, the embeddings typically have lots of parameters, so we would have to adapt them before finetuning. Taking both aspects together, we decided to pass on this option and not make the embeddings trainable (and consequently not apply LoRA to them).
Transformer Layers Finetuning all parameters in the transformer layers would be inefficient. Therefore, we need to at least adapt them with LoRA to become parameter-efficient. This leads us to consider whether we should train all layers, and all components within each layer? Or should we train some layers, some components, or specific combinations of both? There is no general answer here. We'll adapt these layers and their modules and explore the details further in this article.

In the illustration above, on the right, you can see an exemplary selection of modules to finetune on the right. This is just one combination, but many other combinations are possible. Keep in mind as well that the illustration only shows five layers, while your model likely has more. For instance, the RoBERTa base model–used in our example–has 12 layers, a number that is considered small by today's standards. Each layer also has 6 components:

Attention: Query, Key, Value, Output
Feed Forward: Up, Down

Even if we disregard that we also want to tune r and – for now – just focus on the binary decision of which modules to include, this will leave us with 64 (2**6) combinations per layer. Given this only looks at the combinations of one layer, but that we have 12 layers that can be combined, we end up with more than a sextillion combinations:

In [1]: (2**6)**12.
Out[1]: 4.722366482869645e+21

It's easy to see that we can't exhaustively compute all combinations, let alone to explore the space manually.

Typically in computer science, we turn to the dice when we want to explore a space that is too large to fully investigate. But in this case, we could sample from that space, but how would we interpret the results? We would get back a number of arbitrary combination of layers and components (at least 12*6=72 following the small example of above). How would we generalize from these details to find higher-level rules that align with our natural understanding of the problem space? We need to align these details with our conceptual understanding on a more abstract level.

Hence, we need to consider groups of modules and look for structures or patterns that we can use in our experiments, rather than operating on a collection of individual components or layers. We need to develop an intuition about how things should work, and then formulate and test hypotheses.

Question: Does it help to experiment on defined groups of parameters in isolation? The answer is yes. These isolated groups of parameters can lead the way even though we may need to combine some of them later to achieve the best results. Testing in isolation allows us to see patterns of impact more clearly.

However, there is a risk. When these patterns are used in combination, their impact may change. That's not perfect, but let's not be so negative about it

Tags: Editors Pick Fine Tuning Llm NLP Peft