Fine-Tune Llama 3.2 for Powerful Performance on Targeted Tasks
In this article, I discuss how to run Llama 3.2 locally and fine-tune the model to increase its performance on specific tasks. Working with large language models has become a critical part of any data scientist's or ML engineer's job, and fine-tuning the large language models can lead to powerful improvements in the language models' capabilities. This article will thus show you how you can fine-tune Llama3.2 to improve its performance within a targeted domain.

Motivation
My motivation for this article is that I want to spend more time working on large language models and figure out how to utilize them effectively. There are many options for utilizing large language models effectively, such as prompt tuning, RAG systems, or function calling. However, fine-tuning a model is also a valid option, though it requires more effort than my three options. Fine-tuning large language models requires a solid GPU, training data, which may require a lot of manual work, and setting up the training script. Luckily, however, the Unsloth library makes fine-tuning a lot simpler, which is the package I will be using in this article.
The goal of the article is to show you how you can set up a training run to fine-tune Llama 3.2, which can help you solve more complex challenges with large language models. In general, I am a strong believer in fine-tuning large language models, as I think a more specialized model (such as a fine-tuned model), should consistently outperform a more generalized model (such as GPT-4o), within the domain of the specialized model. This opens up the possibility of creating high-performance, specialized models within specific fields, which helps further extract value from utilizing large language models.
To work with a specific example in this article, I will generate a dataset to fine-tune Llama3.2 and provide concise answers to its prompts. This will be a simple task, you can fine-tune on a smaller GPU (I for example am using a GTX1660 super with 6GB VRAM, and one training session took 1.5 hours). This will help you test out fine-tuning of large language models on a smaller scale before you increase the experiment size to create more powerful fine-tuned models.
Table of Contents
· Motivation · Table of Contents · A note on running from Windows ∘ Ensuring NVIDIA driver is available ∘ Install C compiler ∘ Install Python-dev environment · Creating a dataset · Setting up the training script to fine-tune Llama 3.2 · Testing the model ∘ Performing inference ∘ Comparing results · Conclusion
A note on running from Windows
If you are not running the code on a Windows operating system, you can skip to the next section. This section will simply help you solve some Windows-specific problems when working with the contents of this article.
I am fine-tuning Llama3.2 on Windows, but running the Unsloth library requires the Triton package, which is unavailable on Windows. I therefore have to use Windows Subsystem for Linux (WSL). Luckily, you can easily download WSL from the Windows store, by searching for WSL and downloading one of the Ubuntu distributions. You can then run the application, which will open up a terminal, allowing you to work in a Linux environment.
Fine-tuning a large language model requires a GPU, as training on a CPU is too time-consuming. Accessing the GPU from WSL, however, is not trivial and is a problem I struggled a lot with while working on this article. I will summarize the main points that allow me to run the Unsloth library with access to my GPU on a Windows machine here.
After downloading WSL, you can enter your C drive, with the command:
cd /mnt/c/
Where you can access all your normal files located in your C drive.
Ensuring NVIDIA driver is available
You can verify that you have access to your GPU with the command:
nvidia-smi
Which should print out something similar to the below:

If you can see your GPU like in the image below, you are all good. If you get an error message, however, you need to download an NVIDIA driver for your GPU from the NVIDIA website.
Furthermore, you should verify your NVIDIA driver with:
nvcc --version
Which should print something like in the image below:

Install C compiler
I also had to install a C compiler with
sudo apt-get update
sudo apt-get install build-essential
Install Python-dev environment
Finally, I also had to install Python-dev with:
sudo apt-get update && sudo apt-get install python3-dev
This was all the steps I had to take to be able to run Unsloth on my Windows machine. This might be different on your computer, as I could have some pre-installed packages before attempting to make the Unsloth library work from WSL. Let me know if you have questions about running Unsloth on your Windows machine. If you are running on a Mac OS or Linux directly, this should be far simpler.
Creating a dataset
Creating a dataset for fine-tuning a large language model is often time-consuming, though it depends heavily on the problem at hand. In this article, I will attempt to keep it as simple as possible by having Llama 3.2 learn to answer math questions with only the number and nothing else (no text explaining why, for example). Creating a dataset for this can, for example, be done by prompting GPT to give you a dataset and then inspect the dataset to ensure its correctness. To quickly generate a large dataset, however, I will generate random numbers, add them together using Python, and thus create a large dataset of math questions and answers.
You can create the dataset with the following code:
import pandas as pd
def create_random_math_question():
# select 2 random numbers
import random
num1 = random.randint(1, 1000)
num2 = random.randint(1, 1000)
res = num1 + num2
return num1, num2, res
dataset = []
for _ in range(10000):
num1, num2, res = create_random_math_question()
prompt = f"What is the answer to the following math question: {num1} + {num2}?"
dataset.append((prompt, str(res)))
df = pd.DataFrame(dataset, columns=["prompt", "target"])
Which results in a dataframe such as:

Now, I need to ensure the data is in the correct format. Looking into the dataset Labonne used to train the model, each row the format for the text as:
[{"from": "human", "value": ""}, {"from": "gpt", "value": ""}]
You can do this conversion with the following Python code:
new_dataset = []
for prompt, response in dataset:
new_row = [{"from": "human", "value": prompt}, {"from": "gpt", "value": response}]
new_dataset.append(new_row)
df = pd.DataFrame({'conversations': new_dataset})
df.to_pickle("math_dataset.pkl")
It is important to save to pickle and not CSV, as if you save to CSV, the lists in the dataframe will be converted to strings.
The dataframe now looks like this:

Finally, convert it to a dataset object using the code below. First, install the datasets package.
pip install datasets
Then run the code:
from datasets import Dataset
dataset = Dataset.from_pandas(df)
You now have a dataset ready to fine-tune the model.
Setting up the training script to fine-tune Llama 3.2
This article from Maxime Labonne on fine-tuning Llama 3.1 was a massive inspiration for this article. He showed me how I can easily set up a fine-tuning process. I also highly recommend his articles, as he consistently produces high-quality content.
Thus, I will be using a lot of the code from Labonne's article in my article. The biggest change is which model I am using and the training data I will be using to fine-tune the model.
First, you must install the required packages:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
And import the packages
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported
Now, you can load the model. I had to make some changes to Labonne's article. First, I lowered the max sequence length to save memory when training (a lower sequence length requires less memory). Additionally, I load the Llama3.2 model from Unsloth. I load the smallest available model from Llama3.2 since that requires less memory to train and run inference.
max_seq_length = 1024
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-1B-Instruct",
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
To train the model, you can load a parameter-efficient version of the model (since we are fine-tuning using LoRA, you select only the most important parameters to update in the large language model). You can read an excellent article on LoRA here.
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
use_rslora=True,
use_gradient_checkpointing="unsloth"
)
Now load the dataset you created:
tokenizer = get_chat_template(
tokenizer,
mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
chat_template="chatml",
)
def apply_template(examples):
messages = examples["conversations"]
text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
return {"text": text}
df = pd.read_pickle("math_dataset.pkl")
dataset = Dataset.from_pandas(df)
dataset = dataset.map(apply_template, batched=True)
And finally, you can run the training:
trainer=SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=True,
args=TrainingArguments(
learning_rate=3e-4,
lr_scheduler_type="linear",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
warmup_steps=10,
output_dir="output",
seed=0,
),
)
print("Training")
trainer.train()
now = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")
model.save_pretrained_merged(f"model_{now}", tokenizer, save_method="merged_16bit")
Testing the model
Performing inference
I will now perform a qualitative test of the fine-tuned model. I will give it math questions and compare responses from the baseline model and the fine-tuned model to see how the fine-tuning has changed the model.
I generate a test set with the following code, similar to how I made the training dataset:
import pandas as pd
def create_random_math_question():
# select 2 random numbers
import random
num1 = random.randint(1, 1000)
num2 = random.randint(1, 1000)
res = num1 + num2
return num1, num2, res
dataset = []
for _ in range(1000):
num1, num2, res = create_random_math_question()
prompt = f"What is the answer to the following math question: {num1} + {num2}?"
dataset.append((prompt, str(res)))
new_dataset = []
for prompt, response in dataset:
new_row = [{"from": "human", "value": prompt}, {"gt": response}]
new_dataset.append(new_row)
df = pd.DataFrame(new_dataset, columns=["prompt", "gt"])
df.to_pickle("math_dataset_test.pkl")
I then test the model on my fine-tuned model with:
"""
File to run inference on a test sample of documents for new fine-tuned model
"""
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TrainingArguments, TextStreamer
from peft import PeftModel
import pandas as pd
SAVED_MODEL_FOLDER = "model" # TODO update this to the folder your main model is saved to
SAVED_ADAPTER_FOLDER = "output/checkpoint-24-own-dataset" # TODO update this to the folder your adapter model is saved to
max_seq_length = 1024
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=SAVED_MODEL_FOLDER,
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
model = FastLanguageModel.for_inference(model)
model = PeftModel.from_pretrained(model, SAVED_ADAPTER_FOLDER)
df_test = pd.read_pickle("math_dataset_test.pkl")
messages = df_test["prompt"].tolist()
responses = []
for message in messages:
message = [message] # must wrap in a list
inputs = tokenizer.apply_chat_template(
message,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer)
response = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=64, use_cache=True)
# the response is a list of tokens, decode those into text
response_text = tokenizer.decode(response[0], skip_special_tokens=True)
responses.append(response_text)
# save responses to pickle
df_test["response_finetuned_model"] = responses
now = pd.Timestamp.now()
df_test.to_pickle(f"math_dataset_test_finetuned_{now}.pkl")
To test my non-fine-tuned model, I remove the following line, which is the code that loads the adapter weights:
model = PeftModel.from_pretrained(model, SAVED_ADAPTER_FOLDER)
If you read about LoRA, you will know how it works, but if you still need to, I will explain here. When you run the fine-tuning, you do not have to store the complete model again; you only have to store the weight change (delta W). Since LoRA, in this case, only modified around 0.5% of the model weights, the adapted weights will take about 0.5% of the space of the whole model, saving a ton of storage space. The line above adds the modified weights to the original model weights, which results in your fine-tuned model.
Comparing results
After running inference on the test set for both the original model and the fine-tuned model, I can now compare the results. I will provide some random outputs from each model here, out of the 200 prompts I tested each model on. You can see the results in the image below:

As you can see from the image above, the fine-tuning has had its desired effect. In contrast to the original model, the fine-tuned model is able to respond with only the answer and not provide a full explanation. Thus, you can see that the fine-tuning has worked as intended.
You should note, however, that the fine-tuned model will perform very well in this case since the test set is very similar to the train set (the only difference is which numbers are being added). Furthermore, you could likely achieve similar behavior with prompt tuning, for example, by prompting the model to respond only with the answer and no explanations. However, this is just an example showcasing how you can fine-tune a large language model to modify its behavior. In practice, you would do this on a larger scale to optimize the model for your desired use case.
Conclusion
In this model, I have discussed how you can fine-tune Llama3.2 to modify its behavior. I first discussed my motivation for working on fine-tuning a large language model. My motivation for this is that large language models are a quickly developing field that can be used for a large variety of tasks. Being able to tune the models to perform better on specific tasks can be immensely powerful and lead to promising results. Furthermore, I discussed the difficulties I faced working on a GPU using WSL and how you can resolve those issues. I then proceeded to the training code to run the fine-tuning, and finally, I ran a test, showcasing how the fine-tuning had its desired effect, making the model give more concise answers.