LoRA Fine-Tuning On Your Apple Silicon MacBook

Author:Murphy | View: 24774 | Time: 2025-03-22 19:37:04

As models become smaller, we are seeing more and more consumer computers capable of running LLMs locally. This both dramatically reduces the barriers for people training their own models and allows for more training techniques to be tried.

One consumer computer that can run LLMs locally quite well is an Apple Mac. Apple took advantage of its custom silicon and created an array processing library called MLX. By using MLX, Apple can run LLMs better than many other consumer computers.

In this blog post, I'll explain at a high-level how MLX works, then show you how to fine-tune your own LLM locally using MLX. Finally, we'll speed up our fine-tuned model using quantization.

Let's dive in!

MLX Background

What is MLX (and who can use it?)

MLX is an open-source library from Apple that lets Mac users more efficiently run programs with large tensors in them. Naturally, when we want to train or fine-tune a model, this library comes in handy.

The way MLX works is by being very efficient with memory transfers between your Central Processing Unit (CPU), Graphics Processing Unit (GPU), and the Memory Management Unit (MMU). For every system architecture, the most time-intensive operations are when you are moving memory between registers. On Nvidia GPUs, they minimize memory transfers by creating huge amounts of SRAM on their devices. For Apple, they designed their silicon so that the GPU and the CPU have access to the same memory via the MMU. Consequently, the GPU won't have to load data into its memory before acting on it. This architecture is called System on Chip (SOC), and it typically requires you to build your chip internally rather than combine other manufacturers pre-built parts.

Image by Author – Memory Access Patterns for SOC vs Standard

Because Apple now designs its own silicon, it can write low-level software that makes highly efficient use of it. This however means that anyone using a Mac with an Intel processor will not be able to make use of this library.

Installing MLX

Once you have an Apple Silicon computer, there are a few ways we can install MLX. I'll show you how to use python virtual environments but note that you can also install this via a separate environment manager like conda.

In our terminal, we'll start by creating a virtual environment named venv and then step into it.

python -m venv venv;
source ./venv/bin/activate

Now that our environment is set, we use pip to install:

pip install mlx

Running a Simple Inference

With our library setup locally, let's pick a model that we're going to run. I like to use the Phi family of models, as they are quite small compared to other models (3B parameters vs 7B) yet still have quite good performance.

We can download the model and inference it using the same terminal command:

python -m mlx_lm.generate 
    --model microsoft/Phi-3.5-mini-instruct 
    --prompt "Who was the first president?" 
    --max-tokens 4096

To explain our command here, we are using the built-in mlx_lm function to let our library know we'll be inferencing with a language model. We pass in the model that we're using with the name appearing as it does in HuggingFace (Phi-3 appears this way on HuggingFace). We pass in the maximum tokens we'll allow in our response, and then finally we pass through the prompt.

Image by Author – Base Model for "Who was the first president"

Fine-Tuning

Generating Fine-Tuning Dataset

To keep our example simple but useful, we are going to fine-tune the model so that it always responds in JSON with the following schema:

{
 "context": "...", 
 "question": "...", 
 "answer": "..."
}

To use MLX for fine-tuning, we need our dataset to be in a schema that it understands. There are 4 formats: chat, tools, completions, and text. We're going to focus on completions so that when we prompt the model it will return its answer in JSON format. Completions require that we have our training data use the following pattern:

{
 "prompt": "...", 
 "completion": "...", 
}

Now that we have an idea of how we pass our data to MLX, we need to find a good fine-tuning dataset. I created the below python script to process the squad_v2 dataset into the schemas we need it to follow for MLX.

from datasets import load_dataset
import json
import random

print("Loading dataset and tokenizer...")
qa_dataset = load_dataset("squad_v2")

def create_completion(context, question, answer):
    if len(answer["text"]) < 1:
        answer_text = "I Don't Know"
    else:
        answer_text = answer["text"][0]

    completion_template = {
        "context": context,
        "question": question,
        "answer": answer_text
    }

    return json.dumps(completion_template)

def process_dataset(dataset):
    processed_data = []
    for sample in dataset:
        completion = create_completion(sample['context'], sample['question'], sample['answers'])
        prompt = sample['question']
        processed_data.append({"prompt": prompt, "completion": completion})
    return processed_data

print("Processing training data...")
train_data = process_dataset(qa_dataset['train'])
print("Processing validation data...")
valid_data = process_dataset(qa_dataset['validation'])  # SQuAD v2 uses 'validation' as test set

# Combine all data for redistribution
all_data = train_data + valid_data
random.shuffle(all_data)

# Calculate new split sizes
total_size = len(all_data)
train_size = int(0.8 * total_size)
test_size = int(0.1 * total_size)
valid_size = total_size - train_size - test_size

# Split the data
new_train_data = all_data[:train_size]
new_test_data = all_data[train_size:train_size+test_size]
new_valid_data = all_data[train_size+test_size:]

# Write to JSONL files
def write_jsonl(data, filename):
    with open(filename, 'w') as f:
        for item in data:
            f.write(json.dumps(item) + 'n')

print("Writing train.jsonl...")
folder_prefix = "./data/"
write_jsonl(new_train_data, folder_prefix+'train.jsonl')
print("Writing test.jsonl...")
write_jsonl(new_test_data, folder_prefix+'test.jsonl')
print("Writing valid.jsonl...")
write_jsonl(new_valid_data, folder_prefix+'valid.jsonl')

print(f"Dataset split and saved: train ({len(new_train_data)}), test ({len(new_test_data)}), valid ({len(new_valid_data)})")

# Verify file contents
def count_lines(filename):
    with open(folder_prefix+filename, 'r') as f:
        return sum(1 for _ in f)

print("nVerifying file contents:")
print(f"train.jsonl: {count_lines('train.jsonl')} lines")
print(f"test.jsonl: {count_lines('test.jsonl')} lines")
print(f"valid.jsonl: {count_lines('valid.jsonl')} lines")

Importantly, in the squad_v2 dataset, we have examples in this dataset where the answer is unknown and we tell it specifically to write "I Don't Know". This helps reduce hallucination by showing the model what to do if it doesn't know the answer given the context.

At the end of this step, we now have a dataset like below split into files for training, testing, and validating:

{"prompt": "...", 
 "completion": "{"context": "...", 
                 "question": "...", 
                 "answer": "..."
               }"
}

LoRA Fine-Tuning

To fine-tune, we are going to use the built-in Lora function within MLX. To learn more about the mathematics and theory behind LoRA, check out my blog post here.

python -m mlx_lm.lora 
    --model microsoft/Phi-3.5-mini-instruct 
    --train 
    --data ./data 
    --iters 100

Running this naively, we see that we can achieve a final validation loss of 1.530, which isn't bad given that we're only updating the weights of 0.082% of the model.

You'll notice at the end that we've saved our new LoRA weights as adapters. Adapters hold the updates we have learned we should make to the weights during fine-tuning. We have separate adapter files, rather than just update the model immediately, because we may have a bad training run or want to keep multiple fine-tunes for different tasks. To give ourselves more options, we typically store the base weights separately from the updates until we want to make the weights a permanent edition via fusing.

Inferencing Our Fine-Tuned Model

Now that we have the adapters generated, let's see how to use them during inference to get better outputs. We want to test that the outputs are coming out as we expected. In our case, we expect that given a prompt the model will give us our answer in the JSON schema we did before.

We again use the mlx_lm.generate command, only this time we pass in the additional parameter adapter-path. This tells Mlx where to find the additional weights and makes sure that we use them when inferencing.

python -m mlx_lm.generate 
    --model microsoft/Phi-3.5-mini-instruct 
    --adapter-path ./adapters 
    --prompt "Who was the first president?" 
    --max-tokens 4096

When we run the above command, we see that we get back a response in JSON with the keys we fine-tuned it to include.

Image by Author – Fine-Tuned Model Output for "Who was the first president"

Passing More Parameters to LoRA

We were fortunate that our first-run got the model to follow our formatting pretty well. If we had run into more issues, we would have wanted to specify more parameters for LoRA to take into account. To do so, you create a lora_config.yaml file and pass that into the LoRA command like the below. See an example yaml config file here.

python -m mlx_lm.lora --config

Quantizing

What is Quantization?

From the above run, we can see that the model was using substantial resources. It took ~17 seconds to generate each token and used about 7 gigabytes worth of memory at peak. While it may make sense to inference a big model in some cases, for us we are looking to get the most bang for our buck running a LLM locally. Consequently, we'd like to have the model use less memory and run faster. Without changing the model's architecture, we can optimize here by quantizing.

To understand quantizing, let me first explain how we store the model's parameters. Each parameter is a number, and typically in scientific computing we use the float representation to ensure that our calculations are as accurate as possible (to learn more about the exact layout here checkout my blog here). Nevertheless, as you can see below this requires a significant number of bits to represent each number.

Image by Author – IEEE Representation of Floating Point 32 (FP32)

As we tend to use billions of parameters, the size of each parameter has a significant impact on the total memory footprint of the model. Additionally, floating point calculations require more compute than integer calculations typically do. It was these two pressures that led people to experiment with new data types to store the parameters. When we quantize the model, we can go from using floats to using integers.

Image by Author – 8 bit Integer Representation

The trade off here is that we are able to do calculations faster and use less memory, but our performance tends to degrade with less precise parameter values. The art here is maintaining as much performance from the base model as possible while speeding it up and making the model take up less space.

Quantizing Our Model

To quantize our model, we run the following command:

python -m mlx_lm.convert 
    --hf-path microsoft/Phi-3-mini-4k-instruct 
    -q 
    --q-bits 4

We tell the model to quantize by passing the -q flag & then specify the bits for each weight with the --q-bits flag.

Once this is complete, it will create a folder locally called mlx_model that stores our new quantized model. It will convert all of the weights stored in HuggingFace to integers represented with 4 bits (one of the largest reductions).

QLoRA

Now that we have our quantized model, we can run QLoRA on it using the same training data and command we used to run LoRA. MLX is smart enough to see that if the weights are quantized it should switch over to using QLoRA.

Our terminal command looks nearly the same as before, but this time we tell it to use the quantized model we have locally as the source rather than the one on hugging face.

python -m mlx_lm.lora 
    --model ./mlx_model 
    --train 
    --data ./data 
    --iters 100

Image by Author – Fine-Tuning output from QLoRA Run

Now we can inference our QLoRA fine-tuned model and compare:

python -m mlx_lm.generate 
    --model ./mlx_model 
    --adapter-path ./adapters 
    --prompt "Who was the first president?" 
    --max-tokens 4096

Comparing this with the original fine-tune, we can see that the memory usage was significantly lower and the tokens per second generated was also significantly higher. When we send this out to users, they will definitely notice the faster speed. To determine quality, we will have to compare loss between the functions.

For the LoRA model, our validation loss at the end was 1.530 while the QLoRA model had a loss of 1.544. While it is expected that the LoRA model would have a smaller loss, the fact that the QLoRA model isn't that far away means we did a pretty good job!

Closing

In closing, this blog showed you how to fine-tune your own LLM locally using your Mac and MLX. As we see more and more computing power brought into consumer hardware, we can expect more and more training techniques to become possible. This can open the door to far more use cases for ML and help us solve more problems.

To see the full-code used for this blog, check out the GitHub repo below: