Small Language Models: Using 3.8B Phi-3 and 8B Llama-3 Models on a PC and Raspberry Pi

Author:Murphy | View: 24507 | Time: 2025-03-22 21:37:51

Nowadays, we can observe an interesting twist in developing new AI models. For a long time, it has been known that bigger models are "smarter" and capable of doing more complex things. But they are also more computationally expensive. Big device manufacturers like Microsoft, Google, and Samsung have already started to promote new AI features to their clients, but it is clear that if millions of users massively use AI on their phones or laptops, the computational cloud costs could be enormous. What is the solution? The obvious way is to run a model on-device, which has advantages in latency (no network connection is required, and the model can be accessed immediately), privacy (no need to process user responses in the cloud), and, naturally, computation costs. Using local AI models is important not only for laptops and smartphones but also for autonomous robots, smart home assistants, and other edge devices.

At the time of making this article, at least two models were announced that were specially designed for on-device running:

Google's Gemini Nano. The model was announced in December 2023; it has two versions with 1.8B and 3.25B parameters. According to the developer.android.com webpage, the model will be a part of the Android OS and will be available via the AI Edge SDK. However, this model is not open and probably will not be accessible on platforms like HuggingFace.
Microsoft's Phi-3. The model was released in April 2024. It is a 3.8B model that is available in two context-length variants, with 4K and 128K tokens (according to Microsoft, 7B and 14B models will also be available soon). The model was optimized for NVIDIA and ONNX runtime, and it can also run on a CPU. Last but not least, the Phi-3 model is open and can be downloaded.

At the time of writing this text, Google's Gemini Nano is in the "early access preview" state and is not available for public testing. Microsoft's Phi-3 is available on HuggingFace, and we can easily use it. As a baseline, I will use an 8B Llama-3 model, which is the newest model from Meta, also released in 2024.

Methodology

I will test 3.8B and 8B language models using different prompts with increasing complexity, from "easy" to "hard":

Simple prompt: answering a user's simple question.
Text processing: summarization and making an answer for incoming messages.
Tools and agents: answering questions that require external tools.

To test the models, I will use an open-source LlamaCpp library and an open-source GenAI ONNX library from Microsoft. I will test both models on my desktop PC and Raspberry Pi, and we will be able to compare their performance and system requirements.

Let's get started!

1. Install

1.1 Raspberry PiThe goal of this article is to test the model's performance on edge devices, and I will use a Raspberry Pi for that:

The Raspberry Pi is a cheap (about $100) credit card-size single-board ARM-based computer running 64-bit Linux. It has no moving parts, requires only 5V DC power, and has plenty of hardware interfaces (GPIO, Serial, I2C, SPI, HDMI), which makes the Raspberry Pi interesting for robots or smart home devices. But how well can it work with small language models? Let's figure it out.

Raspberry Pi has its own Debian-based OS made by the Raspberry Pi Foundation, which can be good for basic scenarios and home use, but I've found that the newest libraries and software packages are tricky to install. I tried to install the ONNX GenAI runtime on the Raspberry Pi OS, but the installation failed. The ONNX GenAI is a new project, and it has a lot of dependencies that do not work "out of the box." In theory, it is possible to find a way to build the newest CMake and GCC with C++20 support from the source, but in my case, it just was not worth the time. So, I decided to use the latest Ubuntu OS, which has better software support and fewer compatibility problems. Ubuntu has official support for the Raspberry Pi as well, so the installation works smoothly:

Raspberry Pi OS Installer, Image by author

The code presented in this article is cross-platform, and readers who don't have a Raspberry Pi can also test the Phi-3 and Llama-3 models on Windows, OSX, or other Linux environments.

1.2 LlamaCppWe can use Phi-3 and Llama-3 models with an open-source LlamaCpp-Python library. LlamaCpp is written in pure C/C++ without any dependencies, and it works on all modern architectures, including CPU, CUDA, and Apple Silicon. We can easily build it for the Raspberry Pi:

CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip3 install llama-cpp-python

When the installation is done, we also need to download both models:

pip3 install huggingface-hub
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf Phi-3-mini-4k-instruct-q4.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download QuantFactory/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

1.3 ONNX Generative AIAnother way to use a Phi-3 model is to use Microsoft's open-source GenAI ONNX library. ONNX (Open Neural Network Exchange) is an open format designed to represent machine learning models. Microsoft has a well-written tutorial about using Phi-3 with ONNX. Alas, on a Raspberry Pi, it does not work. Pip cannot find a proper installer for the ARM64 onnxruntime-genai package, and we need to build it from the source. Before compiling the onnxruntime-genai, we need to install the onnxruntime package and copy its library files to the source folder:

pip3 install onnxruntime numpy

wget https://github.com/microsoft/onnxruntime/releases/download/v1.17.3/onnxruntime-linux-aarch64-1.17.3.tgz
tar -xvzf onnxruntime-linux-aarch64-1.17.3.tgz

git clone https://github.com/microsoft/onnxruntime-genai.git --branch v0.2.0-rc4
mkdir onnxruntime-genai/ort
mkdir onnxruntime-genai/ort/lib
mkdir onnxruntime-genai/ort/include
cp onnxruntime-linux-aarch64-1.17.3/lib/* onnxruntime-genai/ort/lib
cp onnxruntime-linux-aarch64-1.17.3/include/* onnxruntime-genai/ort/include

cd onnxruntime-genai
python3 build.py

When the compilation is done, we can install a new wheel using pip:

pip3 install build/wheel/onnxruntime_genai-0.2.0rc4-cp312-cp312-linux_aarch64.whl

As a last step, we need to download the Phi-3 ONNX model:

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

Now, all components are installed, and we are ready for testing.

2. Inference

As was written before, I will be using two libraries for running out models, LlamaCpp and ONNX – let's create Python methods for it.

Let's start with LlamaCpp:

from llama_cpp import Llama

def load_llama_model(path: str) -> Llama:
    """ Load LlamaCpp model from file """
    return Llama(
        model_path=path,
        n_gpu_layers=0,
        n_ctx=4096,
        use_mmap=False,
        echo=False
    )

When the model is loaded, we can run the generation stream:

def llama_inference(model: Llama, prompt: str) -> str:
    """ Call a model with a prompt """
    stream = model(prompt, stream=True, max_tokens=4096, temperature=0.2)
    result = ""
    for output in stream:
        print(output['choices'][0]['text'], end="")
        result += output['choices'][0]['text']
    print()
    return result

As for ONNX, the process is generally the same, though the code is slightly bigger:

import onnxruntime_genai as og

def load_onnx_model(path: str):
    """ Load the ONNX model """
    return og.Model(path)

def onnx_inference(model: og.Model, prompt: str) -> str:
    """ Run the ONNX model with a prompt """
    tokenizer = og.Tokenizer(model)
    params = og.GeneratorParams(model)
    params.try_use_cuda_graph_with_max_batch_size(1)
    search_options = {"temperature": 0.2, "max_length": 4096}
    params.set_search_options(**search_options)
    params.input_ids = tokenizer.encode(prompt)
    generator = og.Generator(model, params)

    result = ""
    tokenizer_stream = tokenizer.create_stream()
    while not generator.is_done():
        generator.compute_logits()
        generator.generate_next_token()

        new_token = generator.get_next_tokens()[0]
        new_char = tokenizer_stream.decode(new_token)
        print(new_char, end='', flush=True)
        result += new_char
    print()

    del generator
    return result

3. Testing The Prompts

Now, let's see how the Phi-3 model works. As a baseline, I will compare it to the Llama-3 8B model, which is the newest small-size model at the time of writing this article.

Before using the models, let's create helper functions for making a prompt in the required formats:

def make_phi3_prompt(question: str) -> str:
    """ Make prompt in Phi-3 format """
    return f"<|user|>n{question} <|end|>n<|assistant|>"

def make_llama_prompt(system: str, question: str) -> str:
    """ Make prompt in Llama-3 format """
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
               {system}<|eot_id|><|start_header_id|>user<|end_header_id|>
               {question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

Now, let's test different prompts with the Phi and Llama models.

3.1 Basic PromptAs a warm-up, let's ask a simple question:

#> Phi-3:
prompt = make_phi3_prompt("You are a helpful AI assistant.n"
                          "What is the distance to the Moon?")
onnx_inference(model, prompt)

#> The average distance from Earth to the Moon is about 238,900 miles 
#> (384,400 kilometers). However, this distance varies due to the 
#> elliptical shape of the orbit.

#> Llama-3
prompt = make_llama_prompt("You are a helpful AI assistant.",
                           "What is the distance to the Moon?")
llama_inference(model, prompt)

#> The average distance from the Earth to the Moon is about 384,400 
#> kilometers (238,900 miles). This is called the "lunar distance" or "lunar
#> mean distance." However, the orbit is not a perfect circle and its 
#> distance from Earth varies slightly due to the elliptical shape of its 
#> orbit. At its closest point (called perigee), the Moon is about 356,400
#> kilometers (221,500 miles) away, and at its farthest point (apogee), 
#> it is about 405,500 kilometers (252,000 miles) away.

Both models made correct answers, which was expected, there was nothing complex in it.

3.2 Answering The MessageAs we can see, basic prompts work well. However, mobile users will likely not ask about the distance to the Moon too often

Tags: Data Science Editors Pick Hands On Tutorials Programming Small Language Model