Quantisation and co. Reducing inference times on LLMs by 80%

Author:Murphy | View: 29530 | Time: 2025-03-23 12:19:36

Source: https://www.pexels.com/photo/cropland-in-autumn-18684338/

Quantisation is a technique used for a range of different algorithms but has gained prevalence with the fairly recent influx of Large Language Models (LLMs). In this article, I aim to provide information on the quantisation of LLMs and the impact this technique can have on running these models locally. I'll cover a different strategy outside quantisation that can further reduce computational requirements of running these models. I'll go on to explain why these techniques may be of interest to you and will show you some benchmarks with code examples as to how effective these techniques are. I also briefly cover hardware requirements/recommendations and the modern tools available to you for achieving your LLM goals on your machine. In a later article I plan to provide step-by-step instructions and code for fine-tuning your own LLM so keep an eye out for that.

TL;DR – by quantising our LLM and changing the tensor dtype, we are able to run inference on an LLM with 2x the parameters whilst also reducing Wall time by 80%.

As always, if you wish to discuss anything I cover here please reach out.

All opinions in this article are my own. This article is not sponsored.

What is quantisation (of LLMs)?

Quantisation allows us to reduce the size of our neural networks by converting the network's weights and biases from their original floating-point format (e.g. 32-bit) to a lower precision format (e.g. 8-bit). The original floating point format can vary depending on several factors such as the model's architecture and training processes. The ultimate purpose of quantisation is to reduce the size of our model, thereby reducing memory and computational requirements to run inference and train our model. Quantisation can very quickly become fiddly if you are attempting to quantise the models yourself. This largely comes down to lacking hardware support from particular vendors. Thankfully this can be bypassed through use of specific 3rd party services and software.

Personally I have had to jump through a fair few hoops to quantise LLMs such as Meta's Llama-2 on my Mac. This largely comes down to the lack of support for standard libs (or anything with custom CUDA kernels). 3rd party tools such as optimum and onnx do however exist to make our lives a little easier.

The quick and easy option is to download any of the pre-quantised models that are available on HuggingFace (HF). I specifically wish to shout-out TheBloke for providing quantised versions of a whole host of popular LLMs including the LLama-2 models I'll be demonstrating in this article. Instructions on how to run inference on each model can be found on the respective model cards.

If you want to run quantised models yourself and don't have access to your own GPU, I'd suggest renting NVIDIA hardware on either of the following sites:

· Runpod.io

· Lambdalabs.com

· Vast.ai – DISCLAIMER – use at your own discretion. Here you are essentially renting a random person's GPU. I'd suggest not sharing any sensitive information when using this service. It is however very cheap.

If you wish to buy NVIDIA hardware and want the best bang for buck, I would suggest buying 2x used RTX3090's. Whilst the newer RTX4090 has better benchmark performance , LLMs require high memory read/write speed rather than the higher speed of the processor. There is not a huge difference in memory read/write speed between the 3090 and 4090 models so, in my opinion, the older model provides better value.

If you have cash to spend, the sky is your limit.

As free options I'd suggest:

· Google colab – offer free GPUs at runtime with certain restrictions (RAM is also restricted in the free tier however you can pay for more)

· Kaggle also offer GPUs within their notebooks.

If you insist on using Mac hardware, my suggestion would be the M2 Ultra with as much RAM as you can afford (ideally 64GB+). This will still be slower than the above NVIDIA options but is definitely viable if you wish to just run inference on LLMs rather than training your own. If you are having trouble quantising your own models on Mac hardware, I can only recommend Georgi Greganov's llama.cpp. Using this repo you can download and compile Meta's llama 2 models in C++ and quantise them to 4-bit precision. We can then run inference on these models. The README of the repo gives clear instructions as to how you can do this.

So why on earth do we want to run/host our own LLMs locally?

The short answer is, as always, it depends. As of writing this article, Openai's GPT4 (available via ChatGPT) is widely considered the best performing LLM available. The pricing I would argue is also very reasonable and the model itself is no doubt easier interact with than using the strategies I alluded to above. The only dependencies you need to install are your account info and credit card number ;).

I do however believe a strong case can be made for running your own LLM locally:

Asking questions about proprietary documents/data. You have the ability to fine-tune your own LLM using your own contexts and data. By doing this yourself you are not sharing any of this information 3rd parties which is a huge plus.

Asking questions about topics after September 2021 knowledge cut-off (GPT4). I have seen some cases of GPT4 providing detail on topics after this time period, however the model frequently states the knowledge cut-off exists.

Fine-tune a model to solve problems that are specific to your scenario. Again this links to the first point, you can tune your own LLM model to suit your needs.

You get to see how these LLM's work under the hood. You can inspect the model architecture and further develop your understanding of a technology that.

It's free (provided you have your own hardware already and don't count the electricity to run it)

Quantisation will ultimately aid you in running your own LLM locally by using less computational resources than if you were to run inference on the un-quantised model.

Benchmark comparison Llama-2

I will now demonstrate the effect of quantisation on Meta's Llama-2 7B and 13B models. I ran these experiments on a rented GPU as described above but have also tested them in a google colab notebook to confirm the results are reproducible. The only edit we have to make is run an 8-bit quantised version of the 7B parameter model as our baseline in the colab notebook otherwise it exceeds memory limits when running inference (which in my eyes already makes a perfect case for using quantisation when running LLMs!). Feel free to follow along though – the code examples are pulled directly from the free version of my colab notebook.

If you are using the colab notebook – when installing dependencies such as accelerate and bitsandbytes, use regular pip installs within the notebook. Once installed, restart the runtime. Otherwise the packages will not be recognised. Also don't forget to change your runtime to GPU by selecting runtime > change runtime type > T4 GPU.

I should add, a pre-requisite for this is to have been granted access to the models by Meta and HF. In order to do so you must first sign-up via a request form to Meta via this link :

https://ai.meta.com/resources/models-and-libraries/llama-downloads/

It can take anywhere from 2 minutes to 2 days to receive confirmation of access. Please note the email addresses you use for the Meta form and your HF account must match in order to use the models via the HF API.

Once confirmation has been received, you can login to hugging face and start working with the models.

Let's begin.

Inference on Llama2–7B base with 8-bit quantisation.

Let's first handle our imports – at this stage if you get any error messages just run the pip installs as needed – don't forget to restart your runtime once installed (as above).

Python">from transformers import AutoModelForCausalLM,AutoTokenizer
import torch
from accelerate import Accelerator

Next we copy the model name from hugging face so the API can download the relevant one. We also need to enter our HF access token. This can be found by selecting your profile in the top right of the HF site > Settings > Access Tokens > either generate a token or copy your existing one.

model_name = "meta-llama/Llama-2-7b-hf"
hf_key = "insertyourkeyhere"

Now let's download the model:

model = AutoModelForCausalLM.from_pretrained(model_name, device_map=0, load_in_8bit=True, token=hf_key)

Here we use the _devicemap argument to select the hardware we wish to use. In this case it selects the first GPU available. It is possible to load custom _devicemaps here but that falls outside the scope of this article. Note also the _load_in8bit argument. This is the quantisation step we are taking to reduce the memory requirements of running inference. If you are looking to build bigger projects/products using LLMs, this simple technique can be useful for model deployment on devices with limited resources (edge devices or mobile phones etc.)

Next we set-up our tokeniser:

tokeniser = AutoTokenizer.from_pretrained(model_name, token=hf_key)
prompt = "A great hobby to have is "
toks = tokeniser(prompt, return_tensors="pt")

Enter whichever prompt you wish. The base model we are using is trained for text completion.

Now let's run inference on our tokenised prompt. Feel free to review the HF documentation if any of the syntax is new to you. Essentially we are unpacking the contents of our toks object and passing it to our GPU. The output is restricted to a maximum of 15 tokens (you can edit this parameter if you wish). The model.generate() method is used to generate our output using our Llama2 model. Once done, we transfer the output to CPU memory again so we can view our output.

%%time
res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu')
res
# OUPUT
# CPU times: user 7.47 s, sys: 1.17 s, total: 8.64 s
# Wall time: 16.4 s

Let's break down these timing metrics to better understand what we are seeing. The CPU time is broken down into 3 main components:

user – this represents the time spent in user-mode code. Or in other words, the time spent it takes for the CPU to execute the python code. In this case it took 7.47 seconds. This metric is often also referred to as user time.
sys – this represents the amount of CPU time spent in system calls or kernel-mode code. It is the the time the CPU spends executing operating system-related tasks on behalf of our Python code. In our case it's 1.17 seconds.
total – is the total of our user and sys times.

Next is the Wall time. This refers to the amount of ‘real-world' time it took to run our block of code.

The discrepancy between CPU times and Wall time (7.76 seconds) is due to the other memory intensive operations involved with running inference on our model. These include but are not limited to GPU memory transfers, scheduling, I/O operations etc.

Let's decode our result to see the output of the model:

tokeniser.batch_decode(res)
# OUTPUT
# [' A great hobby to have is 3D printing.nYou can make whatever you want in 3D']

Awesome. We've successfully run inference on a base quantised LLM.

A further technique we can use to speed up inference fairly drastically is to assign a different dtype to the tensors used within our Llama2 model during computation. Where we previously quantised the model's parameters by using the _load_in8bit=True argument, we will now use the _torchdtype=torch.bfloat16 argument to reduce memory usage of our model during inference. This 2nd method is not considered a quantisation technique as it only changes the data type used by our model's tensors, whereas the first involves quantisation by reducing the precision of the model's parameters to 8-bits during loading.

Both are considered effective techniques for reducing the computational requirements of running our LLMs. Let's see just how effective the 2nd technique is.

Let's update our model with the new parameters:

model = AutoModelForCausalLM.from_pretrained(model_name, device_map=0, torch_dtype=torch.bfloat16)

At this stage colab may complain saying you have run out of memory. Simply restart the runtime by selecting Runtime > Restart runtime and re-run all relevant cells within the notebook.

Now we run inference on our model with updated tensor dtypes:

%%time res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu') res # OUTPUT # CPU times: user 1.65 s, sys: 440 ms, total: 2.09 s # Wall time: 4.7 s

Wow. So by adjusting the tensor dtypes, we reduced our total CPU time by 6.66 seconds. Our Wall time was reduced by ~71%. Let's decode our output to see if we notice any impact of the changed dtype:

tokeniser.batch_decode(res) # OUTPUT # [' A great hobby to have is 3D printing. It's a fun way to create new things,']

There are a range of metrics and tests we can use to evaluate and compare the outputs of our model. In this article I will simply employ human evaluation. Both outputs are passable, coherent and relevant. Considering the 71% reduction in wall time in our 2nd example, I'd say our techniques so far were a success.

Let's see how quickly we can run inference on a pre-quantised Llama2–7B model.

Inference on pre-quantised Llama2–7B with updated tensor dtypes.

Courtesy of TheBloke we are able to access pre-quantised versions of Meta's Llama-2 models. Details on the quantisation process can be found on the model card .

We will use the same tensor dtype technique that provided us with the impressive reduction in wall time. This time with the pre-quantised model.

Let's update the model:

model_name = 'TheBloke/Llama-2-7b-Chat-GPTQ'

The Q at the end of the name is indicative of the quantisation already performed on the model.

Now we download the model with the updated tensor dtypes:

model = AutoModelForCausalLM.from_pretrained(model_name, device_map=0, torch_dtype=torch.float16)

Update the tokeniser:

tokeniser = AutoTokenizer.from_pretrained(model_name, token=hf_key) prompt = "A great hobby to have is " toks = tokeniser(prompt, return_tensors="pt")

Run inference:

%%time res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu') res # OUTPUT # CPU times: user 1.44 s, sys: 351 ms, total: 1.79 s # Wall time: 4.33 s

We've made even further improvements. As you can see, the total CPU times are reduced by ~14%. The wall time isreduced by ~8%.

Let's check the output:

tokeniser.batch_decode(res) # OUTPUT # [' A great hobby to have is 3D printing.n 3D printing is a fascinating hob']

Now it is fairly clear the final word has been trimmed due to our token limit being set to 15. I confirm I increased the token limit and the final word evaluated to hobby. In terms of human validation I still give this a pass.

Now let's combine everything we've learned and run inference on the larger Llama-2–13B model. This model has almost 2x the number of parameters as those we've been testing earlier. We'll benchmark the outcomes against the first model we trained (the base Llama-2–7B with 8-bit quantisation) and see how the two compare.

Inference on pre-quantised Llama2–13B with updated tensor dtypes.

We'll use all the same syntax but update the model name of course.

model_name = 'TheBloke/Llama-2-13B-GPTQ'

Download model with updated tensor dtypes:

model = AutoModelForCausalLM.from_pretrained(model_name, device_map=0, torch_dtype=torch.float16)

Update tokenizer:

tokeniser = AutoTokenizer.from_pretrained(model_name, token=hf_key) prompt = "A great hobby to have is " toks = tokeniser(prompt, return_tensors="pt")

Run inference:

%%time res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu') res # OUTPUT # CPU times: user 1.45 s, sys: 167 ms, total: 1.61 s # Wall time: 3.22 s

Let's put this into context:

Inference times Meta-Llama-2–7B (8-bit quantisation) vs. Pre-quantised LLama-2–13B with float16 tensors

We've almost doubled the number of parameters (from 7B to 13B). We've reduced the total CPU time by 81% and Wall time by 80%. I won't lie I'm pretty happy with this outcome.

Let's get the output:

tokeniser.batch_decode(res) # OUTPUT # [' A great hobby to have is 3D printing. It is a great way to create things that you want']

Not only have we considerably reduced inference times by reducing computational requirements, I would argue the output of the 13B model also is more coherent than the first 7B model on which we ran inference.

I hope this article shows you how effective these techniques are in drastically reducing inference times on these LLMs. In our first example, it wasn't even possible to load the model in our notebook without first applying our own quantisation. Essentially by using these techniques, we are able to deploy a much larger LLM (number of parameters), decrease inference times by circa 80% and improve the output. If this isn't a positive outcome I don't know what is!

I'm happy to discuss and exchange ideas on any of the topics covered here.

All images belong to the author unless otherwise stated.

Tags: AI Data Science Deep Dives Large Language Models Python

Add Fav

~~Comment~~

Murphy

Add friends

View space

Message

Recommend

◦ The Facility Dispersion Problem: Mixed-Integer Programming Models

◦ Modern Data And Application Engineering Breaks the Loss of Business Context

◦ Using Vector Steering to Improve Model Guidance

◦ 8 Best Data Version Control Tools in 2023

◦ Examining the Influence Between NLP and Other Fields of Study

◦ Leverage OpenAI Tool calling: Building a reliable AI Agent from Scratch

◦ Scaling RAG from POC to Production

◦ How to Manipulate the Total in Power BI

◦ LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab

◦ Mapping South America with R: A Deep Dive into Geo-Visualization

◦ Creating 3D Videos from RGB Videos

◦ Generative AI and Civic Institutions