Fine-tune Llama 3 with ORPO

Author:Murphy | View: 21749 | Time: 2025-03-22 22:01:06

ORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.

In this article, we will fine-tune the new Llama 3 8B model using ORPO with the TRL library. The code is available on Google Colab and in the LLM Course on GitHub.

⚖️ ORPO

Instruction tuning and preference alignment are essential techniques for adapting Large Language Models (LLMs) to specific tasks. Traditionally, this involves a multi-stage process: 1/ Supervised Fine-Tuning (SFT) on instructions to adapt the model to the target domain, followed by 2/ preference alignment methods like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to increase the likelihood of generating preferred responses over rejected ones.

However, researchers have identified a limitation in this approach. While SFT effectively adapts the model to the desired domain, it inadvertently increases the probability of generating undesirable answers alongside preferred ones. This is why the preference alignment stage is necessary to widen the gap between the likelihoods of preferred and rejected outputs.

Note how the probability of rejected responses increases during supervised fine-tuning (image from the ORPO paper).

Introduced by Hong and Lee (2024), ORPO offers an elegant solution to this problem by combining instruction tuning and preference alignment into a single, monolithic training process. ORPO modifies the standard language modeling objective, combining the negative log-likelihood loss with an odds ratio (OR) term. This OR loss weakly penalizes rejected responses while strongly rewarding preferred ones, allowing the model to simultaneously learn the target task and align with human preferences.

ORPO has been implemented in the major fine-tuning libraries, like TRL, Axolotl, and LLaMA-Factory. In the next section, we will see how to use with TRL.