I Fine-Tuned the Tiny Llama 3.2 1B to Replace GPT-4o

Author:Murphy | View: 23032 | Time: 2025-03-22 20:03:21

A young pediatrician and a renowned physician, who would treat a baby's cough better?

Although both are doctors and can treat a child's cough, a pediatrician is a specialist who can better diagnose a baby, isn't it?

This is what fine-tuning does to smaller models. They make the tiny, weaker models solve specific problems better than the giants, which claim to do everything under the sun.

I was recently in a situation where I had to pick one over the other.

I was building a query-routing bot. It routes the user query to the correct department, where a human agent would continue the conversation. Under the hood, it's a simple text classification task.

GPT-4o (and the mini one) does it incredibly well, but it's rigid and expensive. It's a closed model, so you can't fine-tune it on your infrastructure. OpenAI offers fine-tuning on its platform itself, but that's too costly for me.

Training GPT-4o costs $25/1M token. My training data was quickly a few million tokens. Plus, serving the fine-tuned models costs about 50% more than the regular ones.

At this cost, my tiny app can't go live. I'm on a budget crunch!

The fallback option is to train an open-source model. Open-source models perform amazingly well in classification tasks. However, training requires GPUs for a significant amount of time.

I decided to bet on tiny models.

Smaller LLMs are the only way to fine-tune and achieve a decent outcome at a lower cost—exactly what I want if it works.

Smaller models can run on cheap hardware and be fine-tuned with relatively cheap GPUs. Besides, their training and inference times are much faster than those of bigger LLMs.

There are several candidate models—Phi3.5, DistillBERT, and GPT-Neo—but I decided to try Meta Llama 3.2's 1B model. This was a random pick. Perhaps I was influenced by the hype surrounding this model recently.

How Much Stress Can Your Server Endure if You're Self-Hosting LLMs?

In this post, I'll compare the results of a fine-tuned Llama 3.2–1b instruct model against GPT-4o with few-shot prompting.

Here's how I trained and what I found.

Fine-tuning Llama 3.2 1B (for free)

Yes, I spent nothing on training.

Fine-tuning can be costly unless you choose the right strategy. You can either retrain all the parameters, perform transfer learning, or perform parameter-efficient fine-tuning.

The first one, full-parameter training, is perhaps the costliest and the most dangerous. As you'd have guessed, it would retrain all the 1B parameters in the model. This is going to take a hell of a lot of time and your budget. Besides, full parameter tunning suffers from something called "catastrophic forgetting." This means that when you fine-tune a model, it may lose some of the previously learned knowledge in the pre-training stage.

The second, transfer learning, is a good candidate but a bit complex.

Transfer Learning: The Highest Leverage Deep Learning Skill You Can Learn.

The last, parameter-optimized fine-tuning (PEFT), is inexpensive and effective. The idea is to train only some of the parameters.

More specifically, Low-Rank Adaptation (LORA) is like the sweet spot of all the fine-tuning strategies. In Lora, we fine-tune a subset of parameters only in a few selected layers, making fine-tuning fast and effective.

Another critical decision that dramatically reduces training infrastructure needs is quantization. To understand, we can store the parameters of a model either as float16 values or a smaller one. Smaller ones take up much less space in the memory and are faster to compute. But it comes at a cost of low accuracy.

We've now decided upon LORA as the fine-tuning strategy. But where do we get free infrastructure for training?

Colab or Kaggle Notebooks.

These platforms provide GPUs for free and are often sufficient to fine-tune this small model.

Fine-tunned Llama-3.2 vs. OpenAI GPT-4o with few-shot prompting

LoRA fine-tuning is very popular. I don't want to bore you with another tutorial.

Check out the Colab notebook made available by Unsloth, which has a fantastic step-by-step guide. In fact, I've been using this notebook for my fine-tuning tasks, too.

Let me tell you what to change.

Firstly, the notebook is fine-tuning the 3B parameter Llama-3.2. You can leave it as it is if that's what you want, or you can choose from one of the many models available. I've changed it to Llama-3.2–1B-Instruct because I want to test whether the smallest model is sufficient for my task.

Then, there's a cell that transforms the dataset to a required format. I've changed it to use my fine-tuning dataset.

# Before
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

# After
from datasets import Dataset
dataset = Dataset.from_json("/content/insurance_training_data.json")
dataset = dataset.map(formatting_prompts_func, batched = True,)

The easiest way is to use a dataset that complies with the notebook's initial use, such as the one below.

{
    "conversations": [
        {'role': 'user', 'content': }
        {'role': 'assistant', 'content': }
    ]
}

That's it. These two changes are sufficient to fine-tune a model based on your own data.

Evaluating the fine-tuned model

Now comes the exciting part.

LLM evaluation is a broad and interesting topic. I'd say it's also the most valuable LLM dev skill. My previous post discusses evaluating LLM apps.

The Most Valuable LLM Dev Skill is Easy to Learn, But Costly to Practice.

However, I will use a classic confusion matric style evaluation to keep things tidy.

Adding the following code at the end of the notebook would do.

from langchain.prompts import FewShotPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel

# 1. A function to generate response with the fine-tuned model
def generate_response(user_query):
    # Enable faster inference for the language model
    FastLanguageModel.for_inference(model)

    # Define the message template
    messages = [
        {"role": "system", "content": "You are a helpful assistant who can route the following query to the relevant department."},
        {"role": "user", "content": user_query},
    ]

    # Apply the chat template to tokenize the input and prepare for generation
    tokenized_input = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Required for text generation
        return_tensors="pt"
    ).to("cuda")  # Send input to the GPU

    # Generate a response using the model
    generated_output = model.generate(
        input_ids=tokenized_input,
        max_new_tokens=64,
        use_cache=True,  # Enable cache for faster generation
        temperature=1.5,
        min_p=0.1
    )

    # Decode the generated tokens into human-readable text
    decoded_response = tokenizer.batch_decode(generated_output, skip_special_tokens=True)[0]

    # Extract the assistant's response (after system/user text)
    assistant_response = decoded_response.split("nn")[-1]

    return assistant_response

# 2. Generate Responeses with OpenAI GPT-4o

# Define the prompt template for the example
example_prompt_template = PromptTemplate.from_template(
    "User Query: {user_query}n{department}"
)

# Initialize OpenAI LLM (ensure the OPENAI_API_KEY environment variable is set)
llm = ChatOpenAI(temperature=0, model="gpt-4o")

# Define few-shot examples
examples = [
    {"user_query": "I recently had an accident and need to file a claim for my vehicle. Can you guide me through the process?", "department": "Claims"},
    ...
]

# Create a few-shot prompt template
few_shot_prompt_template = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt_template,
    prefix="You are an intelligent assistant for an insurance company. Your task is to route customer queries to the appropriate department.",
    suffix="User Query: {user_query}",
    input_variables=["user_query"]
)

# Define the department model to structure the output
class Department(BaseModel):
    department: str

# Function to predict the appropriate department based on user query
def predict_department(user_query):
    # Wrap LLM with structured output
    structured_llm = llm.with_structured_output(Department)

    # Create the chain for generating predictions
    prediction_chain = few_shot_prompt_template | structured_llm

    # Invoke the chain with the user query to get the department
    result = prediction_chain.invoke(user_query)

    return result.department

# 3. Read your evaluation dataset and predict departments
import json

with open("/content/insurance_bot_evaluation_data (1).json", "r") as f:
    eval_data = json.load(f)

for ix, item in enumerate(eval_data):
  print(f"{ix+1} of {len(eval_data)}")
  item['open_ai_response'] = generate_response(item['user_query'])
  item['llama_response'] = item['open_ai_response']

# 4. Compute the precision, recall, accuracy, and F1 scores for the predictions. 

# 4.1 Using Open AI
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

true_labels = [item['department'] for item in eval_data]
predicted_labels_openai = [item['open_ai_response'] for item in eval_data]

# Calculate the scores for open_ai_response
precision_openai = precision_score(true_labels, predicted_labels_openai, average='weighted')
recall_openai = recall_score(true_labels, predicted_labels_openai, average='weighted')
accuracy_openai = accuracy_score(true_labels, predicted_labels_openai)
f1_openai = f1_score(true_labels, predicted_labels_openai, average='weighted')

print("OpenAI Response Scores:")
print("Precision:", precision_openai)
print("Recall:", recall_openai)
print("Accuracy:", accuracy_openai)
print("F1 Score:", f1_openai)

# 4.2 Using Fine-tuned Llama 3.2 1B Instruct
true_labels = [item['department'] for item in eval_data]
predicted_labels_llama = [item['llama_response'] for item in eval_data]

# Calculate the scores for llama_response
precision_llama = precision_score(true_labels, predicted_labels_llama, average='weighted', zero_division=0)
recall_llama = recall_score(true_labels, predicted_labels_llama, average='weighted', zero_division=0)
accuracy_llama = accuracy_score(true_labels, predicted_labels_llama)
f1_llama = f1_score(true_labels, predicted_labels_llama, average='weighted', zero_division=0)

print("Llama Response Scores:")
print("Precision:", precision_llama)
print("Recall:", recall_llama)
print("Accuracy:", accuracy_llama)
print("F1 Score:", f1_llama)

The above code is pretty self-explanatory. We create a function that uses our fine-tuned model to predict departments. We also make one for the OpenAI GPT-4o.

Then, we use these functions to generate responses to our evaluation dataset.

The evaluation dataset has the expected class, and we now have the generated classes, which is good for computing the metrics. We do that in the subsequent sections.

Here are the results:

OpenAI Response Scores:
Precision: 0.9
Recall: 0.75
Accuracy: 0.75
F1 Score: 0.818

Llama Response Scores:
Precision: 0.88
Recall: 0.73
Accuracy: 0.79
F1 Score: 0.798

The results show that the fine-tuned results are almost close to the GPT-4o's. This is pretty impressive for a tiny model with a 1B parameter.

Of course, GPT-4o does it better, but only by a small margin.

Also, we could have given more examples in the few-shot prompt. It would have made GPT-4o produce better outcomes. But my examples are sometimes paragraphs long. This would skyrocket the cost as OpenAI charges for input tokens.

Final Thoughts

I've become a fan of small LLMs. They are fast, cheap, and sufficient for most use cases – if it doesn't fine-tune them.

In this post, I've discussed how I fine-tuned Llama 3.2 1B, a model that can run on relatively cheap hardware and is free to fine-tune. The task I had on hand was text classification.

Yet, this is not to say the more petite models would consistently outperform giants like GPT-4o—or even Meta Llama's own 8B, 11B, and 90B parameter models. More extensive models have superpowers like multilingual understanding, vision instructions, and excellent world knowledge.

My point is, if those superpowers aren't your concern, why not a tiny LLM?

Thanks for reading, friend! Besides Medium, I'm on LinkedIn and X, too!

Tags: Data Science Large Language Models Machine Learning Open Source OpenAI