Optimizing LLMs with C, and Running GPT, Llama, and Whisper on Your Laptop

Table of content
-
Implementing a simple mathematical function 1.1 The definition of a context 1.2 Initialising tensors 1.3 Forward computation and computation graph 1.4 Compilation and run
- Final remarks on this first part
- Support my writing
Large language models (LLMs) are on a hype everywhere. Newspapers are spending tons and tons of words to describe a new incoming world, assuring that "AI has finally arrived". Although LLMs are bringing a tangible impact in our lives, we must be calm and critically analyse the entire situation. LLMs hype reminds me the same hype "data scientist" jobs had a few years ago. Back in 2014, when I started my PhD, I saw a steady increase in data scientists' job positions, which peaked in around 2018. At that time, news were on hype again writing: "Data scientist: the $1M profession" or "The sexiest job of the 21st century" – do these titles sound familiar with LLMs' ones?
On one side, LLMs are a GREAT technology and a step forward to a more general AI framework. These models are the starting point for a deeper journey into AI, and I am sure one day most of the apps and technologies will rely on these models. However, I can see often, on Medium too, sometimes there's a lack of clarity about these models. Regardless of their power and fantastic outcomes, these models are too heavy to be easily run or trained. Therefore companies need to know LLM very well before deciding any move in any strategic business direction. One of the most poignant points is the huge memory cost these models have the large infrastructure they do need for training and the costly infrastructure they need for inference.
If we think of the basic LLM structure, namely the transformer, we can recognize the classical encoder-decoder structure. At inference time, the decoder does need to have an in-memory mechanism to establish how much score attention to give to a specific input token. This score is based on the possible position of the token in the sentence, as well as on its agreement with the remaining context. Such a mechanism is known as the KV cache. Given the size of this matrix, it's very easy to eat up 3 TB of memory for simple models, with a context length of 2048. To further speed up this calculation we do need to run things on GPUs. Finally, the entire decoder structure is hard to parallelize.
Given this prelude, is it possible to find a compromise or a tradeoff solution that could allow us to run these calculations on a simpler infrastructure? This article shows you how Georgi Gerganov implemented a new optimised C-based tensor library, called ggml
. The commit I am referring to throughout the text is the commit 0f4e99b, which dates back to September 2022, at the beginning of the ggml
adventure. The rationale is to use a base code, to give you a strong understanding of the entire package.
Implementing a simple mathematical function
Before jumping to LLMs, which will come in a second article soon, let's try to decompose the key elements of the library, in order to compute values for a very simple function like: f = ax² .
The definition of a context
Everything in ggml
starts within a context. The context defines the memory requirements, to fit all the tensors in a given model. The context is created starting from a global state:
// global state
struct ggml_state g_state;
struct ggml_state {
struct ggml_context_container contexts[GGML_MAX_CONTEXTS];
};
The global state is built up as a context_container
which is:
struct ggml_context_container {
bool used;
struct ggml_context context;
};
In the container we can notice the presence of the central element for the first version of ggml
which is the ggml_context
:
struct ggml_context {
size_t mem_size;
void * mem_buffer;
bool mem_buffer_owned;
int n_objects;
struct ggml_object * objects_begin;
struct ggml_object * objects_end;
};
ggml_context
contains all the information about how much memory we can use as well as a memory buffer, so we can have sufficient memory in case we don't know how many bytes a tensor can occupy.
The context is then used to initialise the entire process. ggml_init
starts up the initialisation process and returns:
*ctx = (struct ggml_context) {
.mem_size = params.mem_size,
.mem_buffer = params.mem_buffer ? params.mem_buffer : malloc(params.mem_size),
.mem_buffer_owned = params.mem_buffer ? false : true,
.n_objects = 0,
.objects_begin = NULL,
.objects_end = NULL,
};
*ctx
is a new context pointer. We can investigate over the input objects of *ctx
by using GGML_PRINT
within the source code, for example:
GGML_PRINT("%s: context %p with %zu bytes of memoryn", __func__, (void *) ctx, ctx->mem_size);
GGML_PRINT("%s: context %p with %d objectsn", __func__, (void *) ctx, ctx->n_objects);
GGML_PRINT("%s: context %p with %p object starting positionn", __func__, (void *) ctx, (void *) ctx->objects_begin);
GGML_PRINT("%s: context %p with %p object ending positionn", __func__, (void *) ctx, (void *) ctx->objects_end);
On my Apple MacBook M2 Pro, the context has been initialised with 16 GB of memory, 0 objects, and a fresh memory layout with an address 0x0 for objects_begin
and objects_end
The objects_begin
and objects_end
are indeed the next step, namely the memory addresses for creating tensors within the ggml_context
Initialising tensors
For all the functions within ggml
will always find a protocol implementation, for example:
function_with_attribute
→ function_general
→ function_implementation
function_with_attribute
is a function with a specific task, for example, ggml_new_tensor_1d
or ggml_new_tensor_2d
, to generate a 1D or 2D tensor respectively. This specific function calls afunction_general
which is the general layout for the implementation, so for example ggml_new_tensor_Xd
will call ggml_new_tensor
. Finally, function_general
calls the implementation, function_implementation
. In this way, every time we need to modify the code we just need to act on the implementation rather than modifying all the specific functions.
To create a 1D tensor we can use ggml_new_tensor1d
. From the implementation protocol, we can see that ggml_new_tensor_1d
is coded as follows:
struct ggml_tensor * ggml_new_tensor_impl(
struct ggml_context * ctx,
enum ggml_type type,
int n_dims,
const int* ne,
void* data) {
// always insert objects at the end of the context's memory pool
struct ggml_object * obj_cur = ctx->objects_end;
const size_t cur_offset = obj_cur == NULL ? 0 : obj_cur->offset;
const size_t cur_size = obj_cur == NULL ? 0 : obj_cur->size;
const size_t cur_end = cur_offset + cur_size;
size_t size_needed = 0;
if (data == NULL) {
size_needed += GGML_TYPE_SIZE[type];
GGML_PRINT("Size needed %zu ", size_needed);
for (int i = 0; i < n_dims; i++) {
size_needed *= ne[i];
}
// align to GGML_MEM_ALIGN
size_needed = ((size_needed + GGML_MEM_ALIGN - 1)/GGML_MEM_ALIGN)*GGML_MEM_ALIGN;
}
size_needed += sizeof(struct ggml_tensor);
if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
GGML_PRINT("n%s: not enough space in the context's memory pooln", __func__);
assert(false);
return NULL;
}
char * const mem_buffer = ctx->mem_buffer;
struct ggml_object * const obj_new = (struct ggml_object *)(mem_buffer + cur_end);
*obj_new = (struct ggml_object) {
.offset = cur_end + GGML_OBJECT_SIZE,
.size = size_needed,
.next = NULL,
};
if (obj_cur != NULL) {
obj_cur->next = obj_new;
} else {
// this is the first object in this context
ctx->objects_begin = obj_new;
}
ctx->objects_end = obj_new;
struct ggml_tensor * const result = (struct ggml_tensor *)(mem_buffer + obj_new->offset);
ggml_assert_aligned(result);
*result = (struct ggml_tensor) {
/*.type =*/ type,
/*.n_dims =*/ n_dims,
/*.ne =*/ { 1, 1, 1, 1 },
/*.nb =*/ { 0, 0, 0, 0 },
/*.op =*/ GGML_OP_NONE,
/*.is_param =*/ false,
/*.grad =*/ NULL,
/*.src0 =*/ NULL,
/*.src1 =*/ NULL,
/*.n_tasks =*/ 0,
/*.perf_runs =*/ 0,
/*.perf_cycles =*/ 0,
/*.perf_time_us =*/ 0,
/*.data =*/ data == NULL ? (void *)(result + 1) : data,
/*.pad =*/ { 0 },
};
ggml_assert_aligned(result->data);
for (int i = 0; i < n_dims; i++) {
result->ne[i] = ne[i];
}
result->nb[0] = GGML_TYPE_SIZE[type];
for (int i = 1; i < GGML_MAX_DIMS; i++) {
result->nb[i] = result->nb[i - 1]*result->ne[i - 1];
}
ctx->n_objects++;
return result;
}
struct ggml_tensor * ggml_new_tensor(
struct ggml_context * ctx,
enum ggml_type type,
int n_dims,
const int* ne) {
return ggml_new_tensor_impl(ctx, type, n_dims, ne, NULL);
}
struct ggml_tensor * ggml_new_tensor_1d(
struct ggml_context * ctx,
enum ggml_type type,
int ne0) {
return ggml_new_tensor(ctx, type, 1, &ne0);
}
As you can see we have ggml_new_tensor_1d
that calls ggml_new_tensor
that calls the implementation ggml_new_tensor_impl
. The creation of a new tensor resembles the creation of a list. As stated by Georgi, all the new tensors object will be placed at the end of the current memory pool, given a context the end of the context will be the position where the object will be pointing at, where ggml_object
is defined as:
struct ggml_object {
size_t offset;
size_t size;
struct ggml_object * next;
char padding[8];
};
At first, all the tensors are initialised with data == NULL
. The core is the data type, which in ggml
can be: sizeof(int8_t), sizeof(int16_t), sizeof(int32_t)
or sizeof(float)
. These sizes determine the amount of memory needed within the context, so each tensor is perfectly allocated within the memory segment.
Finally, the object is created with all the retrieved information:
*obj_new = (struct ggml_object) {
.offset = cur_end + GGML_OBJECT_SIZE,
.size = size_needed,
.next = NULL,
};
Once the data buffer for the new tensor is computed struct ggml_tensor* const result = (struct ggml_tensor*)(memb_buffer + obj_new-> offset);
the resulting allocated tensor is returned:
*result = (struct ggml_tensor) {
/*.type =*/ type,
/*.n_dims =*/ n_dims,
/*.ne =*/ { 1, 1, 1, 1 },
/*.nb =*/ { 0, 0, 0, 0 },
/*.op =*/ GGML_OP_NONE,
/*.is_param =*/ false,
/*.grad =*/ NULL,
/*.src0 =*/ NULL,
/*.src1 =*/ NULL,
/*.n_tasks =*/ 0,
/*.perf_runs =*/ 0,
/*.perf_cycles =*/ 0,
/*.perf_time_us =*/ 0,
/*.data =*/ data == NULL ? (void *)(result + 1) : data,
/*.pad =*/ { 0 },
};
Let's see a simple example for playing with 1D tensors, by defining a mathematical function f = ax² :
#include "ggml/ggml.h"
#include "utils.h"
int main(int argc, char ** argv) {
// define the memory parameters e.g. 16GB memory
struct ggml_init_params params = {.mem_size=16*1024*1024,
.mem_buffer=NULL,
};
// create a computational context
struct ggml_context * ctx = ggml_init(params);
// define the input tensors
struct ggml_tensor *x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
// x is a variable parameters in our context
ggml_set_param(ctx, x);
// define a constant a
struct ggml_tensor *a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
// return x^2
struct ggml_tensor *x2 = ggml_mul(ctx, x, x);
// compute f = ax^2
struct ggml_tensor *f = ggml_mul(ctx, a, x2);
return 0;
}
Before defining an input tensor we need to specify the memory parameters. In this case, we are assuming we want to use 16 GB of memory and this will be part of the context ggml_context * ctx
. Then, we can start defining a first tensor, x
, which will be the reference variable (e.g. we want to compute a gradient with respect to x
). To make aware ggml
that x
is our main variable, we can add it as a parameter to the context ggml_set_param(ctx, x);
At this moment we are not performing any calculations. We are just instructing ggml
about our function (or model) and how the tensors interact with each other. The take-home message to grasp is that every tensor comes with a specific .op
, an operation. All the new tensors are initialised with GGML_OP_NONE
. This gets modified as soon as we call any new operation on the tensor. This goes into a computational graph so that a user can decide whether to compute a function's value or a function's gradient with respect to an input variable ( for example, in our case, we could ask to compute the gradient with respect to x
).
For exampleggml_mul
performs a variation of the input tensor operation. At first the tensor -> op
is transformed from GGML_NONE
to GGML_OP_MUL
:
struct ggml_tensor * ggml_mul_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
bool inplace) {
assert(ggml_are_same_shape(a, b));
bool is_node = false;
if (!inplace && (a->grad || b->grad)) {
is_node = true;
}
if (inplace) {
assert(is_node == false);
}
struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
// Here we are transforming the operation
result->op = GGML_OP_MUL;
result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
result->src0 = a;
result->src1 = b;
return result;
}
struct ggml_tensor * ggml_mul(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b) {
return ggml_mul_impl(ctx, a, b, false);
}
These calculations are encapsulated in a graph computation structure, that feeds the forward
calculation at inference time for each model we're dealing with.
Forward computation and computation graph
At the moment we just implemented the function f = ax². To perform the real operation we need to create the graph computation. This is done as:
struct ggml_cgraph gf = ggml_build_forward(f);
// set initial params
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
// compute
ggml_graph_compute(ctx, &gf);
printf("k=%fn", ggml_get_f32_1d(f,0));
ggml_build_forward
builds up the forward computation graph. In the forward step we are building the real computational graph, that goes through all the nodes and return a structure ggml_cgraph
:
struct ggml_cgraph result = {
/*.n_nodes =*/ 0,
/*.n_leafs =*/ 0,
/*.n_threads =*/ 0,
/*.work_size =*/ 0,
/*.work =*/ NULL,
/*.nodes =*/ { NULL },
/*.grads =*/ { NULL },
/*.leafs =*/ { NULL },
/*.perf_runs =*/ 0,
/*.perf_cycles =*/ 0,
/*.perf_time_us =*/ 0,
};
For the example above the code returns a graph with 3 nodes, given by x
, x^2
and a*x^2
and 1 leaf. It's possible to have a visual representation of the graph through ggml_graph_dump_dot
function:
// without defining `gf` above run this:
struct ggml_cgraph gf = ggml_build_forward(f);
ggml_graph_dump_dot(&gf, &gf, "name_of_your_graph");
wherer &gf
is the reference to the graph structure, and "name_of_your_graph" refers to the name of the dot
file that ggml
produces. If you want to convert this to an image you just need to run:
dot -Tpng name_of_your_graph -o name_of_your_graph.png && open name_of_your_graph.png
For our example the graph is:

As we'll see later, we can associate a value to our variables (e.g. in this case a = 3.0
) and we can see the graph has the following:
- An initial yellow node, with
GGML_OP_NONE
operation to definex
- A
GGML_OP_MUL
operation, which isx*x
- A leaf, in pink, that refers to the value of another variable (
a
) - The final node, in green, that is another
GGML_OP_MUL
ofa*x^2
Now once all the tensors have been allocated we will have a final graph with all the operations to perform, starting from the parameters variables, x
in our case.
Compute operations in the graph
ggml_compute_forward
is where all the calculations are run.
The input parameters for this function is struct ggml_compute_params * params, struct ggml_tensor * tensor
. The params
specify the operation associated with the tensor within the graph. Through a switch...case
cycle any forward operation is called:
switch (tensor->op) {
case GGML_OP_DUP:
{
ggml_compute_forward_dup(params, tensor->src0, tensor);
} break;
case GGML_OP_ADD:
{
ggml_compute_forward_add(params, tensor->src0, tensor->src1, tensor);
} break;
case GGML_OP_SUB:
{
ggml_compute_forward_sub(params, tensor->src0, tensor->src1, tensor);
} break;
case GGML_OP_MUL:
{
ggml_compute_forward_mul(params, tensor->src0, tensor->src1, tensor);
} break;
...
...
Each single operation is encoded based on the input type of the tensor:
void ggml_compute_forward_mul(
const struct ggml_compute_params * params,
const struct ggml_tensor * src0,
const struct ggml_tensor * src1,
struct ggml_tensor * dst) {
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_mul_f32(params, src0, src1, dst);
} break;
case GGML_TYPE_I8:
case GGML_TYPE_I16:
case GGML_TYPE_I32:
case GGML_TYPE_F16:
case GGML_TYPE_COUNT:
{
assert(false);
} break;
}
}
For the 0f4e99b commit only the GGML_TYPE_F32
was implemented. This calls the main multiplication implementation
void ggml_compute_forward_mul_f32(
const struct ggml_compute_params * params,
const struct ggml_tensor * src0,
const struct ggml_tensor * src1,
struct ggml_tensor * dst) {
assert(params->ith == 0);
assert(ggml_are_same_shape(src0, src1) && ggml_are_same_shape(src0, dst));
if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
return;
}
const int n = ggml_nrows(src0);
const int nc = src0->ne[0];
assert( dst->nb[0] == sizeof(float));
assert(src0->nb[0] == sizeof(float));
assert(src1->nb[0] == sizeof(float));
for (int i = 0; i < n; i++) {
float * x = (float *) ((char *) dst->data + i*( dst->nb[1]));
float * y = (float *) ((char *) src0->data + i*( dst->nb[1]));
float * z = (float *) ((char *) src1->data + i*( dst->nb[1]));
ggml_vec_mul_f32(nc,
(float *) ((char *) dst->data + i*( dst->nb[1])),
(float *) ((char *) src0->data + i*(src0->nb[1])),
(float *) ((char *) src1->data + i*(src1->nb[1])));
}
}
The core of the operation is in the for
loop. In this loop we are dealing with the result tensor x
, the multiplying term src0
, and the multiplier src1
. In particular:
(char *) dst->data
converts the data pointer ofdst
to achar*
. This is done because pointer arithmetic should be performed in bytes, andchar*
is the most flexible type for this purpose.i * (dst->nb[1])
calculates the offset in bytes for the current row. Sincei
increments in each iteration, this effectively moves to the next row in memory, taking into account the striding information.- Finally,
(float *)
is used to cast the result back to afloat*
to ensure that these pointers are interpreted as pointers to floating-point values.
In the context of numerical computing and tensor operations, striding refers to the step size, typically measured in bytes, between consecutive elements along a particular dimension of a tensor. Understanding and correctly handling striding is crucial for efficient tensor operations and memory management.
The operation ggml_vec_mul_f32
performs the final multiplication as:
inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]*y[i]; }
Inline functions are a mechanism provided by C (via the inline
keyword) to suggest to the compiler that a particular function should be expanded "in place" at the call site rather than being called as a separate function. When you call a regular function, there is some overhead associated with it. This includes pushing parameters onto the stack, setting up a new stack frame, and performing the return operation. For very small and frequently used functions, this overhead can be relatively expensive compared to the actual computation. Inlining eliminates this overhead because the code is inserted directly at the call site. Inlining allows the compiler to perform optimizations that wouldn't be possible if the function were called normally. For example, when a function is inlined, the compiler can see its code in the context of the caller and optimize accordingly. This might include constant folding, dead code elimination, and other optimizations that can lead to faster code.
A final simple code
We're now ready to implement a full code in ggml
, computing the function f = ax² for some values. Under the folder exapmles
we can create a new folder called simple_example
There we'll have the main file main.cpp
:
#include "ggml/ggml.h"
#include "utils.h"
int main(int argc, char ** argv) {
struct ggml_init_params params = {.mem_size=16*1024*1024,
.mem_buffer=NULL,
};
// params set up
struct ggml_context * ctx = ggml_init(params);
// tensors
struct ggml_tensor *x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
// x as a parameter
ggml_set_param(ctx, x);
struct ggml_tensor *a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor *x2 = ggml_mul(ctx, x, x);
struct ggml_tensor *f = ggml_mul(ctx, a, x2);
// build the graph for the operations
struct ggml_cgraph gf = ggml_build_forward(f);
// set initial params
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
// compute
ggml_graph_compute(ctx, &gf);
printf("k=%fn", ggml_get_f32_1d(f,0));
// print the graph
ggml_graph_print(&gf);
// save the graph
ggml_graph_dump_dot(&gf, &gf, "final_graph");
return 0;
}
Within the same folder we'll need a CMakeLists.txt
file, so we can compile the code with the ggml
library:
#
# simple_example
set(TEST_TARGET simple_example)
add_executable(${TEST_TARGET} main.cpp)
target_link_libraries(${TEST_TARGET} PRIVATE ggml ggml_utils)
Finally add the following line at the end of the file examples/CMakeLists.txt
: add_subdirectory(simple_example)
Now, everything is all interconnected and can be correctly compiled and run.
Compilation and run
Back to the ggml
folder, as explained in the README.md
file, create a folder called build
and run the following commands:
mkdir build
cd build
cmake ../
make simple_example
This will compile the ggml
library and it will compile and create a binary file for the simple_example
example code. We're ready to run our code by simply typing ./bin/simple_example
. The code will execute the calculation, as well as it will print out the form of graph information, with all the nodes and leaves and their associated operation. For each operation an estimate of the computational time will be given. Remember, if you want to plot out the final graph you need to run dot -Tpng final_graph -o final_graph.png && open final_graph.png
Final remarks on first part
In this first article, we started to deeply understand how ggml
works, and what's its underlying philosophy. In particular, we had a deep dive in:
ggml_context
and how memory is initialised and used within theggml
library- How to initialised a new 1D tensor and the protocol implementations within
ggml
- How the graph computation works, retrieve the graph computation and plot it out
- A simple example, initialising a mathematical function and getting back its computational graph
In the next post, we'll be dealing with LLM, in particular Gpt. We'll see how to implement them and use them within ggml
and, finally, run GPT models on our laptop.
Support my writing
If you enjoyed my article, please support my writing by joining Medium's membership through the link below