SW/HW Co-optimization Strategy for Large Language Models (LLMs)

Author:Murphy | View: 21505 | Time: 2025-03-22 23:40:55

Leading Large Language Models (LLMs) like ChatGPT, Llama, etc. are revolutionizing the tech industry and impacting everyone's lives. However, their cost poses a significant hurdle. Applications utilizing OpenAI APIs incur substantial expenses for continuous operation ($0.03 per 1,000 prompt tokens and $0.06 per 1,000 sampled tokens).

To cut costs, companies tend to host their own LLMs, with expenses varying widely based on model size (larger LLMs with 100–200B parameters can cost ~10 times more compared to smaller ones with 7–15B parameters). This trend has spurred the AI chip race, as major tech companies aim to develop their own AI chips, reducing reliance on expensive hardware.

Trend of model size. Source: AWS reInvent

How to squeeze every bit of computing power to run LLMs? In this article, I am going to do a thorough analysis of LLM optimization strategy across models, software, and hardware. It follows the AI SW/HW co-design methodology I wrote in previous article, with much more in-depth discussion on LLM-specific cost and performance optimization.

How to co-design software/hardware architecture for AI/ML in a new era?

Source: made by author and other colleagues

The compute and memory demands of running LLM models are growing exponentially, while computing/memory capabilities are lagging behind on a slower trajectory, as depicted in the image above. To bridge this performance gap, it's crucial to explore enhancements in three key areas:

Algorithmic Improvement and Model Compression: How can we augment models with features to reduce compute and memory demands without compromising quality? What are the latest advancements in LLM quantization technology that reduce model size while maintaining quality?
Efficient SW Stack and Acceleration Libraries: What considerations are vital in constructing a software stack that seamlessly connects AI models and hardware? How can we expose hardware features to optimize LLM acceleration? What are the prevailing software challenges and potential enhancements?
Powerful AI HW Acceleration and Advanced Memory Hierarchy: What are the contemporary hardware accelerators tailored for LLMs? How can we alleviate the high memory demands through potential advancements in memory hierarchy?

I am going to write one article for each of the above topics. Let's dive into the first one (Algorithmic Improvement and Model Compression) in this post!

LLM is based on transformer architecture (encoder-decoder), and there is decoder-only model architecture including Llama, ChatGPT, etc., encoder-decoder model architecture including Whisper, T5, etc. Emerging models are coming out each day. In this post, we are focusing on 4 new features below to accelerate transformer performance

1. Quantization

Converting FP32 models to INT8 models ideally shrinks memory size by approximately 4x, while INT4 quantization achieves around 8x model size reduction. Moreover, computation costs decrease significantly as integer matrix multiplication surpasses floating-point computation in speed. There are 2 quantization categories — post-training quantization (PTQ) and quantization-aware training (QAT). For inference, PTQ is recommended. Hugging Face hosts a multitude of quantized LLM models utilizing diverse quantization methods like GPTQ, GGUF, AWQ, among others.

Model size reduction through quantization. Source: https://huggingface.co/TheBloke

2. Attention Mechanism

The scaled dot-product attention is notably compute-intensive, involving multiple matrix multiplications of keys, queries, and values. In multi-head attention, numerous attention layers (referred to as heads) are present, each generating outputs that are concatenated together.

An illustration of the scaled dot-product attention (left) and multi-head attention (right), which is simply multiple SDPA heads in parallel. Source: Attention Is All You Need [Ref 1]

For optimized attention inference, the concept of multi-query attention is introduced (Ref [2] Fast Transformer Decoding). In this approach, keys and values are shared across different attention heads, reducing the need for fetching new key-value pairs for each attention head and minimizing data transactions.

Additionally, an intermediate mechanism called grouped-query attention exists between multi-head and multi-query attention. It involves projecting keys and values into groups, unlike the single projection in multi-query attention. This method effectively reduces memory requirements while maintaining model quality.

*A comparison of different attention mechanisms. Source:* GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints [Ref 3]

Flash Attention (Ref [4]). Unlike the conventional approach of computing model layers individually, Flash Attention employs tiling to fuse multiple layers and compute the tile to the final result in a single operation. The tile size is system memory hierarchy-aware, optimizing IO operations. The figure below demonstrates the concept and latency improvements of Flash Attention compared to PyTorch's native implementation.

The tiled Flash Attention computation pattern and the memory hierarchy on a 40 GB GPU. Source: Flash Attention: Fast and Memory-Efficient Exact Attention with IO-Awareness

3. Paged KV Cache

Key-value caches can become substantial with a high number of input and output tokens, featuring dynamic lengths that contribute to memory access inefficiencies due to fragmentation and redundant duplication. Drawing inspiration from the virtual memory mechanism in operating systems, Paged Attention aims to minimize redundancy in KV cache memory and facilitate flexible sharing of KV cache within and across requests.

Left: Parameters (gray) persist in memory and KV cache (red) that is allocated per serving request. Right: vLLM helps to slow down memory requirement to boost system throughput. Source: Efficient Memory Management for Large Language Model Serving with PagedAttention [Ref 5]

4. Speculative Sampling [Ref 6]

In autoregressive generation models, generating a single token requires complete model inference, resulting in repetitive weight loading, which is time-consuming. Speculative sampling aims to narrow the gap between small and large models by delivering high-quality results akin to large models but with faster speeds similar to smaller models.

Significant speed up of speculative decoding with AWQ engine. Source: In the Fast Lane! Speculative Decoding – 10x Larger Model, No Extra Cost

Beyond the aforementioned four major inference acceleration techniques from an algorithm and model perspective, numerous other features exist to expedite LLM model inference. These include model/tensor parallelism, model sparsity, knowledge distillation, and more, with new research emerging regularly. Leveraging these techniques is crucial to accelerate LLM solutions.

It's essential to note that optimizing AI workloads always involves a synergy of model, software, and hardware considerations. In upcoming posts, we'll dive into the software stack/libraries and hardware architecture aspects for LLM acceleration, please stay tuned!