Fine-Tune Llama 3.2 for Powerful Performance on Targeted Tasks

Author:Murphy | View: 26885 | Time: 2025-03-22 20:08:57

In this article, I discuss how to run Llama 3.2 locally and fine-tune the model to increase its performance on specific tasks. Working with large language models has become a critical part of any data scientist's or ML engineer's job, and fine-tuning the large language models can lead to powerful improvements in the language models' capabilities. This article will thus show you how you can fine-tune Llama3.2 to improve its performance within a targeted domain.

This article will show you how to work with and fine-tune Llama3.2 to better solve domain-specific problems. Image by ChatGPT.

Motivation

My motivation for this article is that I want to spend more time working on large language models and figure out how to utilize them effectively. There are many options for utilizing large language models effectively, such as prompt tuning, RAG systems, or function calling. However, fine-tuning a model is also a valid option, though it requires more effort than my three options. Fine-tuning large language models requires a solid GPU, training data, which may require a lot of manual work, and setting up the training script. Luckily, however, the Unsloth library makes fine-tuning a lot simpler, which is the package I will be using in this article.

The goal of the article is to show you how you can set up a training run to fine-tune Llama 3.2, which can help you solve more complex challenges with large language models. In general, I am a strong believer in fine-tuning large language models, as I think a more specialized model (such as a fine-tuned model), should consistently outperform a more generalized model (such as GPT-4o), within the domain of the specialized model. This opens up the possibility of creating high-performance, specialized models within specific fields, which helps further extract value from utilizing large language models.

To work with a specific example in this article, I will generate a dataset to fine-tune Llama3.2 and provide concise answers to its prompts. This will be a simple task, you can fine-tune on a smaller GPU (I for example am using a GTX1660 super with 6GB VRAM, and one training session took 1.5 hours). This will help you test out fine-tuning of large language models on a smaller scale before you increase the experiment size to create more powerful fine-tuned models.

· Motivation · Table of Contents · A note on running from Windows ∘ Ensuring NVIDIA driver is available ∘ Install C compiler ∘ Install Python-dev environment · Creating a dataset · Setting up the training script to fine-tune Llama 3.2 · Testing the model ∘ Performing inference ∘ Comparing results · Conclusion

A note on running from Windows

If you are not running the code on a Windows operating system, you can skip to the next section. This section will simply help you solve some Windows-specific problems when working with the contents of this article.

I am fine-tuning Llama3.2 on Windows, but running the Unsloth library requires the Triton package, which is unavailable on Windows. I therefore have to use Windows Subsystem for Linux (WSL). Luckily, you can easily download WSL from the Windows store, by searching for WSL and downloading one of the Ubuntu distributions. You can then run the application, which will open up a terminal, allowing you to work in a Linux environment.

Fine-tuning a large language model requires a GPU, as training on a CPU is too time-consuming. Accessing the GPU from WSL, however, is not trivial and is a problem I struggled a lot with while working on this article. I will summarize the main points that allow me to run the Unsloth library with access to my GPU on a Windows machine here.

After downloading WSL, you can enter your C drive, with the command:

cd /mnt/c/

Where you can access all your normal files located in your C drive.

Ensuring NVIDIA driver is available

You can verify that you have access to your GPU with the command:

nvidia-smi

Which should print out something similar to the below:

This is what you should see after running the command "nvidia-smi". The output verifies that your NVIDIA driver is installed and you have access to your GPU. Image by the author.

If you can see your GPU like in the image below, you are all good. If you get an error message, however, you need to download an NVIDIA driver for your GPU from the NVIDIA website.

Furthermore, you should verify your NVIDIA driver with:

nvcc --version

Which should print something like in the image below:

This is the output from running the command "nvcc – version", verifying your Nvidia Cuda driver is properly installed.

Install C compiler

I also had to install a C compiler with

sudo apt-get update
sudo apt-get install build-essential

Install Python-dev environment

Finally, I also had to install Python-dev with:

sudo apt-get update && sudo apt-get install python3-dev

This was all the steps I had to take to be able to run Unsloth on my Windows machine. This might be different on your computer, as I could have some pre-installed packages before attempting to make the Unsloth library work from WSL. Let me know if you have questions about running Unsloth on your Windows machine. If you are running on a Mac OS or Linux directly, this should be far simpler.

Creating a dataset

Creating a dataset for fine-tuning a large language model is often time-consuming, though it depends heavily on the problem at hand. In this article, I will attempt to keep it as simple as possible by having Llama 3.2 learn to answer math questions with only the number and nothing else (no text explaining why, for example). Creating a dataset for this can, for example, be done by prompting GPT to give you a dataset and then inspect the dataset to ensure its correctness. To quickly generate a large dataset, however, I will generate random numbers, add them together using Python, and thus create a large dataset of math questions and answers.

You can create the dataset with the following code:

import pandas as pd

def create_random_math_question():
 # select 2 random numbers
 import random
 num1 = random.randint(1, 1000)
 num2 = random.randint(1, 1000)
 res = num1 + num2
 return num1, num2, res

dataset = []
for _ in range(10000):
 num1, num2, res = create_random_math_question()
 prompt = f"What is the answer to the following math question: {num1} + {num2}?"
 dataset.append((prompt, str(res)))

df = pd.DataFrame(dataset, columns=["prompt", "target"])

Which results in a dataframe such as:

This is a printout of the dataset I will use to fine-tune Llama3.2. The dataset is a prompt with a math question, and a simple respond containing the answer to the math question. The goal is to teach the model to only respond with the number (and not include any other text). Image by the author.

Now, I need to ensure the data is in the correct format. Looking into the dataset Labonne used to train the model, each row the format for the text as:

[{"from": "human", "value": ""}, {"from": "gpt", "value": ""}]

You can do this conversion with the following Python code:

new_dataset = []
for prompt, response in dataset:
 new_row = [{"from": "human", "value": prompt}, {"from": "gpt", "value": response}]
 new_dataset.append(new_row)

df = pd.DataFrame({'conversations': new_dataset})
df.to_pickle("math_dataset.pkl")

It is important to save to pickle and not CSV, as if you save to CSV, the lists in the dataframe will be converted to strings.

The dataframe now looks like this:

The dataframe after correcting the format, which ensures you can use the dataset to fine-tune Llama 3.2. Image by the author.

Finally, convert it to a dataset object using the code below. First, install the datasets package.

pip install datasets

Then run the code:

from datasets import Dataset
dataset = Dataset.from_pandas(df)

You now have a dataset ready to fine-tune the model.

Setting up the training script to fine-tune Llama 3.2

This article from Maxime Labonne on fine-tuning Llama 3.1 was a massive inspiration for this article. He showed me how I can easily set up a fine-tuning process. I also highly recommend his articles, as he consistently produces high-quality content.

Thus, I will be using a lot of the code from Labonne's article in my article. The biggest change is which model I am using and the training data I will be using to fine-tune the model.

First, you must install the required packages:

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

And import the packages

import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

Now, you can load the model. I had to make some changes to Labonne's article. First, I lowered the max sequence length to save memory when training (a lower sequence length requires less memory). Additionally, I load the Llama3.2 model from Unsloth. I load the smallest available model from Llama3.2 since that requires less memory to train and run inference.

max_seq_length = 1024
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

To train the model, you can load a parameter-efficient version of the model (since we are fine-tuning using LoRA, you select only the most important parameters to update in the large language model). You can read an excellent article on LoRA here.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"], 
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

Now load the dataset you created:

tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

def apply_template(examples):
    messages = examples["conversations"]
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
    return {"text": text}

df = pd.read_pickle("math_dataset.pkl")
dataset = Dataset.from_pandas(df)
dataset = dataset.map(apply_template, batched=True)

And finally, you can run the training:

trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

print("Training")
trainer.train()

now = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")
model.save_pretrained_merged(f"model_{now}", tokenizer, save_method="merged_16bit")

Testing the model

Performing inference

I will now perform a qualitative test of the fine-tuned model. I will give it math questions and compare responses from the baseline model and the fine-tuned model to see how the fine-tuning has changed the model.

I generate a test set with the following code, similar to how I made the training dataset:

import pandas as pd

def create_random_math_question():
 # select 2 random numbers
 import random
 num1 = random.randint(1, 1000)
 num2 = random.randint(1, 1000)
 res = num1 + num2
 return num1, num2, res

dataset = []
for _ in range(1000):
 num1, num2, res = create_random_math_question()
 prompt = f"What is the answer to the following math question: {num1} + {num2}?"
 dataset.append((prompt, str(res)))

new_dataset = []
for prompt, response in dataset:
 new_row = [{"from": "human", "value": prompt}, {"gt": response}]
 new_dataset.append(new_row)

df = pd.DataFrame(new_dataset, columns=["prompt", "gt"])
df.to_pickle("math_dataset_test.pkl")

I then test the model on my fine-tuned model with:

"""
File to run inference on a test sample of documents for new fine-tuned model
"""

from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TrainingArguments, TextStreamer
from peft import PeftModel
import pandas as pd

SAVED_MODEL_FOLDER = "model" # TODO update this to the folder your main model is saved to
SAVED_ADAPTER_FOLDER = "output/checkpoint-24-own-dataset" # TODO update this to the folder your adapter model is saved to

max_seq_length = 1024
model, tokenizer = FastLanguageModel.from_pretrained(
 model_name=SAVED_MODEL_FOLDER,
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

model = FastLanguageModel.for_inference(model)

model = PeftModel.from_pretrained(model, SAVED_ADAPTER_FOLDER)

df_test = pd.read_pickle("math_dataset_test.pkl")
messages = df_test["prompt"].tolist()

responses = []
for message in messages:
    message = [message] # must wrap in a list
    inputs = tokenizer.apply_chat_template(
        message,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    text_streamer = TextStreamer(tokenizer)
    response = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=64, use_cache=True)

    # the response is a list of tokens, decode those into text
    response_text = tokenizer.decode(response[0], skip_special_tokens=True)
    responses.append(response_text)

# save responses to pickle
df_test["response_finetuned_model"] = responses

now = pd.Timestamp.now()
df_test.to_pickle(f"math_dataset_test_finetuned_{now}.pkl")

To test my non-fine-tuned model, I remove the following line, which is the code that loads the adapter weights:

model = PeftModel.from_pretrained(model, SAVED_ADAPTER_FOLDER)

If you read about LoRA, you will know how it works, but if you still need to, I will explain here. When you run the fine-tuning, you do not have to store the complete model again; you only have to store the weight change (delta W). Since LoRA, in this case, only modified around 0.5% of the model weights, the adapted weights will take about 0.5% of the space of the whole model, saving a ton of storage space. The line above adds the modified weights to the original model weights, which results in your fine-tuned model.

Comparing results

After running inference on the test set for both the original model and the fine-tuned model, I can now compare the results. I will provide some random outputs from each model here, out of the 200 prompts I tested each model on. You can see the results in the image below:

Results from comparing the fine-tuned model vs the non-fine-tuned model. In the first column you can see the prompt each model received. The middle column shows the response from the fine-tuned model, and the rightmost column shows the response from the original column. You can notice two main points from the results. One is that the fine-tuned model is able to respond only with the answer, instead of a full explanation (which is the desired behavior in this case). In the examples, the fine-tuned model also has two examples where it is correct and the non-fintuned model is incorrect (the row with index 196 and 197), while the opposite is the case one time on row index 4. Note that the response length has been cut to be max 64 tokens, which is why the non-fine-tuned model has some cut off responses. Image by the author.

As you can see from the image above, the fine-tuning has had its desired effect. In contrast to the original model, the fine-tuned model is able to respond with only the answer and not provide a full explanation. Thus, you can see that the fine-tuning has worked as intended.

You should note, however, that the fine-tuned model will perform very well in this case since the test set is very similar to the train set (the only difference is which numbers are being added). Furthermore, you could likely achieve similar behavior with prompt tuning, for example, by prompting the model to respond only with the answer and no explanations. However, this is just an example showcasing how you can fine-tune a large language model to modify its behavior. In practice, you would do this on a larger scale to optimize the model for your desired use case.

Conclusion

In this model, I have discussed how you can fine-tune Llama3.2 to modify its behavior. I first discussed my motivation for working on fine-tuning a large language model. My motivation for this is that large language models are a quickly developing field that can be used for a large variety of tasks. Being able to tune the models to perform better on specific tasks can be immensely powerful and lead to promising results. Furthermore, I discussed the difficulties I faced working on a GPU using WSL and how you can resolve those issues. I then proceeded to the training code to run the fine-tuning, and finally, I ran a test, showcasing how the fine-tuning had its desired effect, making the model give more concise answers.

Tags: Fine Tuning Hands On Tutorials Llama 3 Llm Machine Learning