Democratizing LLMs: 4-bit Quantization for Optimal LLM Inference

Author:Murphy | View: 22344 | Time: 2025-03-22 23:15:21

Image generated by DALL-E 3 by the author

Quantizing a model is a technique that involves converting the precision of the numbers used in the model from a higher precision (like 32-bit floating point) to a lower precision (like 4-bit integers). Quantization is a balance between efficiency and accuracy, as it can come at the cost of a slight decrease in the model's accuracy, as the reduction in numerical precision can affect the model's ability to represent subtle differences in data.

This has been my assumption from learning LLMs from various sources.

In this article, we will explore the detailed steps to quantize Mistral-7B-Instruct-v0.2 into a 5-bit and a 4-bit model. We will then upload the quantized models to the Hugging Face hub. Lastly, we will load the quantized models and evaluate them and the base model to find out the performance impact quantization brings to a RAG pipeline.

Does it conform to my original assumption? Read on.

Why do we quantize a model?

The benefits of quantizing a model include the following:

Reduced Memory Usage: Lower precision numbers require less memory, which can be crucial for deploying models on devices with limited memory resources.
Faster Computation: Lower precision calculations are generally faster. This is particularly important for real-time applications.
Energy Efficiency: Reduced computational and memory requirements typically lead to lower energy consumption.
Network Efficiency: When models are used in a cloud-based setting, smaller models with lower precision weights can be transmitted over the network more efficiently, reducing bandwidth usage.
Hardware Compatibility: Many specialized hardware accelerators, particularly for mobile and edge devices, are designed to handle integer computations efficiently. Quantizing models to lower precision allows them to fully utilize these hardware capabilities for optimal performance.
Model Privacy: Quantization can introduce noise and make model extraction more difficult, potentially enhancing model security and privacy in certain scenarios.

How do we quantize a model?

There are multiple techniques to quantize a model, such as NF4, GPTQ, and AWQ. We are going to explore quantizing Mistral-7B-Instruct-v0.2 with Gguf and llama.cpp.

GGUF

GGUF, short for "Georgi Gerganov Universal Format", introduced by the llama.cpp team in August 2023, is a binary file format specifically designed for storing quantized large language models. It was developed by Georgi Gerganov, the creator of llama.cpp, a C++ library for running inference with the Llama models.

GGUF offers a compact, efficient, and user-friendly way to store quantized LLM weights. It is designed for a single-file model deployment and fast inference. It supports various LLM architectures and quantization schemes. GGUF allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up. It democratizes LLMs by reducing model costs, simplifying model loading and saving, and making the models more accessible and efficient.

llama.cpp

llama.cpp provides a lightweight and efficient C++ library for running inference with LLMs stored in GGUF format. llama.cpp's main features include cross-platform support, fast inference, easy integration, and Hugging Face compatibility.

LlamaIndex offers the LlamaCPP class for integration with the llama-cpp-python library.

Together, GGUF and llama.cpp offers a compelling combination of efficiency, performance, and user-friendliness that can significantly enhance your LLM applications.

Quantizing Mistral-7B with GGUF and llama.cpp

Inspired by Maxime Labonne‘s Quantize Llama models with GGUF and llama.cpp, let's explore how to use GGUF and llama.cpp to quantize Mistral-7B-Instruct-v0.2. Check out my Colab notebook for the detailed steps.

Step 1: Install llama.cpp

Let's first install llama.cpp by running the following commands:

# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Step 2: Download and quantize Mistral-7B-Instruct-v0.2

First, I store my Hugging Face token in Colab's secrets tab; see the screenshot below. The benefit of storing my token in this secrets tab is that I don't expose the token in my notebook, and I can reuse this secret configuration for all my Colab notebooks.

See the code snippet below to log into the Hugging Face hub:

# first, log into hugging face hub
from google.colab import userdata
from huggingface_hub import HfApi

HF_TOKEN = userdata.get("HF_TOKEN")
api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

Now, let's download our base model mistralai/Mistral-7B-Instruct-v0.2. We will quantize it in two different methods out of a dozen methods listed on TheBloke/Mistral-7B-Instruct-v0.2-GGUF model card:

Q5_K_M: 5-bit, recommended, low quality loss.
Q4_K_M: 4-bit, recommended, offers balanced quality.

After downloading the base model, we convert it to fp16(16-bit floating-point), a quantization technique that reduces model size and accelerates inference speed while maintaining reasonable model accuracy.

Lastly, we loop through the two quantization methods to quantize our base model. We call llama.cpp/quantize for this activity. See below the code snippet:

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

MODEL_NAME = MODEL_ID.split('/')[-1]

# Download model
!git lfs install
!git clone @huggingface.co/{MODEL_ID">@huggingface.co/{MODEL_ID">https://{username}:{HF_TOKEN}@huggingface.co/{MODEL_ID}

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model 
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

We see from the log the quantized model sizes have shrunk drastically from the base model, and each quantization process took around 4 minutes:

Step 3: Run inference to test the quantized model

Now that we have two quantized models, let's run an inference test by calling llama.cpp/main. See the code snippet below:

import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]
print("Available models: " + ", ".join(model_list))
prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Some interesting output from the inference with the 4-bit quantized model which we tested the inference on:

Step 4: Push the quantized models to the Hugging Face hub

Now, we are ready to push our quantized models to the Hugging Face hub to share with the community (and myself). This step assumes that you have already created an account on Hugging Face.

First, log into the Hugging Face hub, then create an empty repo by invoking the create_repo function on HfApi. Finally, we upload our new gguf files by calling api.upload_folder. See the code snippet below.

!pip install -q huggingface_hub
from huggingface_hub import create_repo , HfApi
from google.colab import userdata

username = "wenqiglantz" #change to your own username

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

# Create empty repo
api.create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)

Log into the Hugging Face hub to verify that those two gguf quantized models have been uploaded successfully under my account.

The next step is properly populating the model card by adding a README.md file. We can mimic the README.md from the base model mistralai/Mistral-7B-Instruct-v0.2‘s Hugging Face repo, especially the language, tags, and license sections. See below my sample README.md, you can customize it as you prefer.

---
license: apache-2.0
pipeline_tag: text-generation
tags:
- finetuned
inference: false
base_model: mistralai/Mistral-7B-Instruct-v0.2
model_creator: Mistral AI_
model_name: Mistral 7B Instruct v0.2
model_type: mistral
prompt_template: '[INST] {prompt} [/INST]
  '
quantized_by: wenqiglantz
---
# Mistral 7B Instruct v0.2 - GGUF

This is a quantized model for `mistralai/Mistral-7B-Instruct-v0.2`. Two quantization methods were used:
- Q5_K_M: 5-bit, preserves most of the model's performance
- Q4_K_M: 4-bit, smaller footprints, and saves more memory


## Description

This repo contains GGUF format model files for [Mistral AI_'s Mistral 7B Instruct v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).

This model was quantized in Google Colab.

Now, let's check our model on the hub:

We have successfully uploaded our quantized models for Mistral-7B-Instruct-v0.2 to the Hugging Face hub. Nice!
Tags: Deep Dives Gguf Hugging Face Llamaindex Model Quantization

Add Fav

~~Comment~~

Murphy

Add friends

View space

Message

Recommend

◦ Deep GPVAR: Upgrading DeepAR For Multi-Dimensional Forecasting

◦ Understanding Group Sequential Testing

◦ Image Contouring with OpenCV

◦ Dog Poop Compass: Bayesian Analysis of Canine Business

◦ Entertainment Data Science: Streaming vs. Theatrical

◦ How (and Where) ML Beginners Can Find Papers

◦ The Transformer Architecture of GPT Models

◦ Let's reproduce NanoGPT with JAX!(Part 1)

◦ 9.11 or 9.9 – which one is higher?

◦ MOMENT: A Foundation Model for Time Series Forecasting, Classification, Anomaly Detection

◦ Create And Revamp Your Own Offline ChatGPT On Local PC With GPT4All LLM In Java

◦ Open-Source Data Observability with Elementary - From Zero to Hero (Part 2)