ExLlamaV2: The Fastest Library to Run LLMs

Author:Murphy | View: 23047 | Time: 2025-03-23 12:00:06

Quantizing Large Language Models (LLMs) is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3 times less VRAM while providing a similar level of accuracy and faster generation. It became so popular that it has recently been directly integrated into the transformers library.

ExLlamaV2 is a library designed to squeeze even more performance out of GPTQ. Thanks to new kernels, it's optimized for (blazingly) fast inference. It also introduces a new quantization format, EXL2, which brings a lot of flexibility to how weights are stored.

In this article, we will see how to quantize base models in the EXL2 format and how to run them. As usual, the code is available on GitHub and Google Colab.

⚡ Quantize EXL2 models

To start our exploration, we need to install the ExLlamaV2 library. In this case, we want to be able to use some scripts contained in the repo, which is why we will install it from source as follows:

git clone https://github.com/turboderp/exllamav2
pip install exllamav2

Now that ExLlamaV2 is installed, we need to download the model we want to quantize in this format. Let's use the excellent zephyr-7B-beta, a Mistral-7B model fine-tuned using Direct Preference Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is an impressive result for a model that is ten times smaller. You can try out the base Zephyr model using this space.

We download zephyr-7B-beta using the following command (this can take a while since the model is about 15 GB):

git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

GPTQ also requires a calibration dataset, which is used to measure the impact of the Quantization process by comparing the outputs of the base model and its quantized version. We will use the wikitext dataset and directly download the test file as follows:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

Once it's done, we can leverage the [convert.py](https://github.com/turboderp/exllamav2/blob/master/convert.py) script provided by the ExLlamaV2 library. We're mostly concerned with four arguments:

-i: Path of the base model to convert in HF format (FP16).
-o: Path of the working directory with temporary files and final output.
-c: Path of the calibration dataset (in Parquet format).
-b: Target average number of bits per weight (bpw). For example, 4.0 bpw will give store weights in 4-bit precision.

The complete list of arguments is available on this page. Let's start the quantization process using the convert.py script with the following arguments:

mkdir quant
python python exllamav2/convert.py 
    -i base_model 
    -o quant 
    -c wikitext-test.parquet 
    -b 5.0

Note that you will need a GPU to quantize this model. The official documentation specifies that you need approximately 8 GB of VRAM for a 7B model, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta using a T4 GPU.

Under the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the precision of the weights while minimizing the impact on the output. You can find more details about the GPTQ algorithm in this article.

So why are we using the "EXL2" format instead of the regular GPTQ format? EXL2 comes with a few new features:

It supports different levels of quantization: it's not restricted to 4-bit precision and can handle 2, 3, 4, 5, 6, and 8-bit quantization.
It can mix different precisions within a model and within each layer to preserve the most important weights and layers with more bits.

ExLlamaV2 uses this additional flexibility during quantization. It tries different quantization parameters and measures the error they introduce. On top of trying to minimize the error, ExLlamaV2 also has to achieve the target average number of bits per weight given as an argument. Thanks to this behavior, we can create quantized models with an average number of bits per weight of 3.5 or 4.5 for example.

The benchmark of different parameters it creates is saved in the measurement.json file. The following JSON shows the measurement for one layer:

"key": "model.layers.0.self_attn.q_proj",
"numel": 16777216,
"options": [
    {
        "desc": "0.05:3b/0.95:2b 32g s4",
        "bpw": 2.1878662109375,
        "total_bits": 36706304.0,
        "err": 0.011161142960190773,
        "qparams": {
            "group_size": 32,
            "bits": [
                3,
                2
            ],
            "bits_prop": [
                0.05,
                0.95
            ],
            "scale_bits": 4
        }
    },

In this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for an average value of 2.188 bpw and a group size of 32. This introduced a noticeable error that is taken into account to select the best parameters.