Optimizing LLMs with C, and Running GPT, Llama, and Whisper on Your Laptop

Author:Murphy | View: 22542 | Time: 2025-03-23 12:36:26

Table of content

Implementing a simple mathematical function 1.1 The definition of a context 1.2 Initialising tensors 1.3 Forward computation and computation graph 1.4 Compilation and run
Final remarks on this first part
Support my writing

Large language models (LLMs) are on a hype everywhere. Newspapers are spending tons and tons of words to describe a new incoming world, assuring that "AI has finally arrived". Although LLMs are bringing a tangible impact in our lives, we must be calm and critically analyse the entire situation. LLMs hype reminds me the same hype "data scientist" jobs had a few years ago. Back in 2014, when I started my PhD, I saw a steady increase in data scientists' job positions, which peaked in around 2018. At that time, news were on hype again writing: "Data scientist: the $1M profession" or "The sexiest job of the 21st century" – do these titles sound familiar with LLMs' ones?

On one side, LLMs are a GREAT technology and a step forward to a more general AI framework. These models are the starting point for a deeper journey into AI, and I am sure one day most of the apps and technologies will rely on these models. However, I can see often, on Medium too, sometimes there's a lack of clarity about these models. Regardless of their power and fantastic outcomes, these models are too heavy to be easily run or trained. Therefore companies need to know LLM very well before deciding any move in any strategic business direction. One of the most poignant points is the huge memory cost these models have the large infrastructure they do need for training and the costly infrastructure they need for inference.

If we think of the basic LLM structure, namely the transformer, we can recognize the classical encoder-decoder structure. At inference time, the decoder does need to have an in-memory mechanism to establish how much score attention to give to a specific input token. This score is based on the possible position of the token in the sentence, as well as on its agreement with the remaining context. Such a mechanism is known as the KV cache. Given the size of this matrix, it's very easy to eat up 3 TB of memory for simple models, with a context length of 2048. To further speed up this calculation we do need to run things on GPUs. Finally, the entire decoder structure is hard to parallelize.

Given this prelude, is it possible to find a compromise or a tradeoff solution that could allow us to run these calculations on a simpler infrastructure? This article shows you how Georgi Gerganov implemented a new optimised C-based tensor library, called ggml. The commit I am referring to throughout the text is the commit 0f4e99b, which dates back to September 2022, at the beginning of the ggml adventure. The rationale is to use a base code, to give you a strong understanding of the entire package.

Implementing a simple mathematical function

Before jumping to LLMs, which will come in a second article soon, let's try to decompose the key elements of the library, in order to compute values for a very simple function like: f = ax² .

The definition of a context

Everything in ggmlstarts within a context. The context defines the memory requirements, to fit all the tensors in a given model. The context is created starting from a global state:

// global state
struct ggml_state g_state;

struct ggml_state {
    struct ggml_context_container contexts[GGML_MAX_CONTEXTS];
};

The global state is built up as a context_container which is:

struct ggml_context_container {
    bool used;

    struct ggml_context context;
};

In the container we can notice the presence of the central element for the first version of ggml which is the ggml_context:

struct ggml_context {
    size_t mem_size;
    void * mem_buffer;
    bool   mem_buffer_owned;

    int n_objects;

    struct ggml_object * objects_begin;
    struct ggml_object * objects_end;
};

ggml_context contains all the information about how much memory we can use as well as a memory buffer, so we can have sufficient memory in case we don't know how many bytes a tensor can occupy.

The context is then used to initialise the entire process. ggml_init starts up the initialisation process and returns:

*ctx = (struct ggml_context) {
        .mem_size         = params.mem_size,
        .mem_buffer       = params.mem_buffer ? params.mem_buffer : malloc(params.mem_size),
        .mem_buffer_owned = params.mem_buffer ? false : true,
        .n_objects        = 0,
        .objects_begin    = NULL,
        .objects_end      = NULL,
    };

*ctx is a new context pointer. We can investigate over the input objects of *ctx by using GGML_PRINT within the source code, for example:

GGML_PRINT("%s: context %p with %zu bytes of memoryn", __func__, (void *) ctx, ctx->mem_size);
GGML_PRINT("%s: context %p with %d objectsn", __func__, (void *) ctx, ctx->n_objects);
GGML_PRINT("%s: context %p with %p object starting positionn", __func__, (void *) ctx, (void *) ctx->objects_begin);
GGML_PRINT("%s: context %p with %p object ending positionn", __func__, (void *) ctx, (void *) ctx->objects_end);

On my Apple MacBook M2 Pro, the context has been initialised with 16 GB of memory, 0 objects, and a fresh memory layout with an address 0x0 for objects_begin and objects_end

The objects_begin and objects_end are indeed the next step, namely the memory addresses for creating tensors within the ggml_context

Initialising tensors

For all the functions within ggml will always find a protocol implementation, for example:

function_with_attribute → function_general → function_implementation

function_with_attribute is a function with a specific task, for example, ggml_new_tensor_1d or ggml_new_tensor_2d, to generate a 1D or 2D tensor respectively. This specific function calls afunction_general which is the general layout for the implementation, so for example ggml_new_tensor_Xd will call ggml_new_tensor. Finally, function_general calls the implementation, function_implementation. In this way, every time we need to modify the code we just need to act on the implementation rather than modifying all the specific functions.

To create a 1D tensor we can use ggml_new_tensor1d. From the implementation protocol, we can see that ggml_new_tensor_1d is coded as follows:

struct ggml_tensor * ggml_new_tensor_impl(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    n_dims,
        const int* ne,
        void*  data) {
    // always insert objects at the end of the context's memory pool
    struct ggml_object * obj_cur = ctx->objects_end;

    const size_t cur_offset = obj_cur == NULL ? 0 : obj_cur->offset;
    const size_t cur_size   = obj_cur == NULL ? 0 : obj_cur->size;
    const size_t cur_end    = cur_offset + cur_size;

    size_t size_needed = 0;

    if (data == NULL) {
        size_needed += GGML_TYPE_SIZE[type];
        GGML_PRINT("Size needed %zu ", size_needed);
        for (int i = 0; i < n_dims; i++) {
            size_needed *= ne[i];
        }
        // align to GGML_MEM_ALIGN
        size_needed = ((size_needed + GGML_MEM_ALIGN - 1)/GGML_MEM_ALIGN)*GGML_MEM_ALIGN;

    }
    size_needed += sizeof(struct ggml_tensor);

    if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
        GGML_PRINT("n%s: not enough space in the context's memory pooln", __func__);
        assert(false);
        return NULL;
    }

    char * const mem_buffer = ctx->mem_buffer;

    struct ggml_object * const obj_new = (struct ggml_object *)(mem_buffer + cur_end);

    *obj_new = (struct ggml_object) {
        .offset = cur_end + GGML_OBJECT_SIZE,
        .size   = size_needed,
        .next   = NULL,
    };

    if (obj_cur != NULL) {
        obj_cur->next = obj_new;
    } else {
        // this is the first object in this context
        ctx->objects_begin = obj_new;
    }

    ctx->objects_end = obj_new;

    struct ggml_tensor * const result = (struct ggml_tensor *)(mem_buffer + obj_new->offset);

    ggml_assert_aligned(result);

    *result = (struct ggml_tensor) {
        /*.type         =*/ type,
        /*.n_dims       =*/ n_dims,
        /*.ne           =*/ { 1, 1, 1, 1 },
        /*.nb           =*/ { 0, 0, 0, 0 },
        /*.op           =*/ GGML_OP_NONE,
        /*.is_param     =*/ false,
        /*.grad         =*/ NULL,
        /*.src0         =*/ NULL,
        /*.src1         =*/ NULL,
        /*.n_tasks      =*/ 0,
        /*.perf_runs    =*/ 0,
        /*.perf_cycles  =*/ 0,
        /*.perf_time_us =*/ 0,
        /*.data         =*/ data == NULL ? (void *)(result + 1) : data,
        /*.pad          =*/ { 0 },
    };

    ggml_assert_aligned(result->data);

    for (int i = 0; i < n_dims; i++) {
        result->ne[i] = ne[i];
    }

    result->nb[0] = GGML_TYPE_SIZE[type];
    for (int i = 1; i < GGML_MAX_DIMS; i++) {
        result->nb[i] = result->nb[i - 1]*result->ne[i - 1];
    }

    ctx->n_objects++;

    return result;
}

struct ggml_tensor * ggml_new_tensor(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    n_dims,
        const int* ne) {
    return ggml_new_tensor_impl(ctx, type, n_dims, ne, NULL);
}

struct ggml_tensor * ggml_new_tensor_1d(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    ne0) {
    return ggml_new_tensor(ctx, type, 1, &ne0);
}

As you can see we have ggml_new_tensor_1d that calls ggml_new_tensor that calls the implementation ggml_new_tensor_impl. The creation of a new tensor resembles the creation of a list. As stated by Georgi, all the new tensors object will be placed at the end of the current memory pool, given a context the end of the context will be the position where the object will be pointing at, where ggml_object is defined as:

struct ggml_object {
    size_t offset;
    size_t size;

    struct ggml_object * next;

    char padding[8];
};

At first, all the tensors are initialised with data == NULL. The core is the data type, which in ggml can be: sizeof(int8_t), sizeof(int16_t), sizeof(int32_t) or sizeof(float). These sizes determine the amount of memory needed within the context, so each tensor is perfectly allocated within the memory segment.

Finally, the object is created with all the retrieved information:

*obj_new = (struct ggml_object) {
      .offset = cur_end + GGML_OBJECT_SIZE,
      .size   = size_needed,
      .next   = NULL,
  };

Once the data buffer for the new tensor is computed struct ggml_tensor* const result = (struct ggml_tensor*)(memb_buffer + obj_new-> offset); the resulting allocated tensor is returned:

*result = (struct ggml_tensor) {
        /*.type         =*/ type,
        /*.n_dims       =*/ n_dims,
        /*.ne           =*/ { 1, 1, 1, 1 },
        /*.nb           =*/ { 0, 0, 0, 0 },
        /*.op           =*/ GGML_OP_NONE,
        /*.is_param     =*/ false,
        /*.grad         =*/ NULL,
        /*.src0         =*/ NULL,
        /*.src1         =*/ NULL,
        /*.n_tasks      =*/ 0,
        /*.perf_runs    =*/ 0,
        /*.perf_cycles  =*/ 0,
        /*.perf_time_us =*/ 0,
        /*.data         =*/ data == NULL ? (void *)(result + 1) : data,
        /*.pad          =*/ { 0 },
    };

Let's see a simple example for playing with 1D tensors, by defining a mathematical function f = ax² :

#include "ggml/ggml.h"
#include "utils.h"

int main(int argc, char ** argv) {
    // define the memory parameters e.g. 16GB memory 
    struct ggml_init_params params = {.mem_size=16*1024*1024,
                                      .mem_buffer=NULL,
                                      };
    // create a computational context 
    struct ggml_context * ctx = ggml_init(params);
    // define the input tensors
    struct ggml_tensor *x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
    // x is a variable parameters in our context
    ggml_set_param(ctx, x);
    // define a constant a
    struct ggml_tensor *a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
    // return x^2
    struct ggml_tensor *x2 = ggml_mul(ctx, x, x);
    // compute f = ax^2
    struct ggml_tensor *f = ggml_mul(ctx, a, x2);

    return 0;
}

Before defining an input tensor we need to specify the memory parameters. In this case, we are assuming we want to use 16 GB of memory and this will be part of the context ggml_context * ctx. Then, we can start defining a first tensor, x, which will be the reference variable (e.g. we want to compute a gradient with respect to x). To make aware ggml that x is our main variable, we can add it as a parameter to the context ggml_set_param(ctx, x);

At this moment we are not performing any calculations. We are just instructing ggml about our function (or model) and how the tensors interact with each other. The take-home message to grasp is that every tensor comes with a specific .op, an operation. All the new tensors are initialised with GGML_OP_NONE. This gets modified as soon as we call any new operation on the tensor. This goes into a computational graph so that a user can decide whether to compute a function's value or a function's gradient with respect to an input variable ( for example, in our case, we could ask to compute the gradient with respect to x).

For exampleggml_mul performs a variation of the input tensor operation. At first the tensor -> op is transformed from GGML_NONE to GGML_OP_MUL :

struct ggml_tensor * ggml_mul_impl(
        struct ggml_context * ctx,
        struct ggml_tensor * a,
        struct ggml_tensor * b,
        bool inplace) {
    assert(ggml_are_same_shape(a, b));

    bool is_node = false;

    if (!inplace && (a->grad || b->grad)) {
        is_node = true;
    }

    if (inplace) {
        assert(is_node == false);
    }

    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
    // Here we are transforming the operation 
    result->op   = GGML_OP_MUL;
    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
    result->src0 = a;
    result->src1 = b;

    return result;
}

struct ggml_tensor * ggml_mul(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b) {
    return ggml_mul_impl(ctx, a, b, false);
}

These calculations are encapsulated in a graph computation structure, that feeds the forward calculation at inference time for each model we're dealing with.

Forward computation and computation graph

At the moment we just implemented the function f = ax². To perform the real operation we need to create the graph computation. This is done as:

struct ggml_cgraph gf = ggml_build_forward(f);
// set initial params
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
// compute
ggml_graph_compute(ctx, &gf);
printf("k=%fn", ggml_get_f32_1d(f,0));

ggml_build_forward builds up the forward computation graph. In the forward step we are building the real computational graph, that goes through all the nodes and return a structure ggml_cgraph:

struct ggml_cgraph result = {
        /*.n_nodes      =*/ 0,
        /*.n_leafs      =*/ 0,
        /*.n_threads    =*/ 0,
        /*.work_size    =*/ 0,
        /*.work         =*/ NULL,
        /*.nodes        =*/ { NULL },
        /*.grads        =*/ { NULL },
        /*.leafs        =*/ { NULL },
        /*.perf_runs    =*/ 0,
        /*.perf_cycles  =*/ 0,
        /*.perf_time_us =*/ 0,
    };

For the example above the code returns a graph with 3 nodes, given by x, x^2 and a*x^2and 1 leaf. It's possible to have a visual representation of the graph through ggml_graph_dump_dot function:

// without defining `gf` above run this:
struct ggml_cgraph gf = ggml_build_forward(f);
ggml_graph_dump_dot(&gf, &gf, "name_of_your_graph");

wherer &gf is the reference to the graph structure, and "name_of_your_graph" refers to the name of the dot file that ggml produces. If you want to convert this to an image you just need to run:

dot -Tpng name_of_your_graph -o name_of_your_graph.png && open name_of_your_graph.png

For our example the graph is:

Fig. 1: The computation graph for the function f=ax². The first node is the input parameter x, which is then multiply by itself, and, finally, it is multiplied by the variable a. The variable a is a leaf in the graph and its value is 3.00

As we'll see later, we can associate a value to our variables (e.g. in this case a = 3.0) and we can see the graph has the following:

An initial yellow node, with GGML_OP_NONE operation to define x
A GGML_OP_MUL operation, which is x*x
A leaf, in pink, that refers to the value of another variable (a)
The final node, in green, that is another GGML_OP_MUL of a*x^2

Now once all the tensors have been allocated we will have a final graph with all the operations to perform, starting from the parameters variables, x in our case.

Compute operations in the graph

ggml_compute_forward is where all the calculations are run. The input parameters for this function is struct ggml_compute_params * params, struct ggml_tensor * tensor. The params specify the operation associated with the tensor within the graph. Through a switch...case cycle any forward operation is called:

switch (tensor->op) {
        case GGML_OP_DUP:
            {
                ggml_compute_forward_dup(params, tensor->src0, tensor);
            } break;
        case GGML_OP_ADD:
            {
                ggml_compute_forward_add(params, tensor->src0, tensor->src1, tensor);
            } break;
        case GGML_OP_SUB:
            {
                ggml_compute_forward_sub(params, tensor->src0, tensor->src1, tensor);
            } break;
        case GGML_OP_MUL:
            {
                ggml_compute_forward_mul(params, tensor->src0, tensor->src1, tensor);
            } break;
            ...
...

Each single operation is encoded based on the input type of the tensor:

void ggml_compute_forward_mul(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    switch (src0->type) {
        case GGML_TYPE_F32:
            {
                ggml_compute_forward_mul_f32(params, src0, src1, dst);
            } break;
        case GGML_TYPE_I8:
        case GGML_TYPE_I16:
        case GGML_TYPE_I32:
        case GGML_TYPE_F16:
        case GGML_TYPE_COUNT:
            {
                assert(false);
            } break;
    }
}

For the 0f4e99b commit only the GGML_TYPE_F32 was implemented. This calls the main multiplication implementation


void ggml_compute_forward_mul_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, src1) && ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert( dst->nb[0] == sizeof(float));
    assert(src0->nb[0] == sizeof(float));
    assert(src1->nb[0] == sizeof(float));

    for (int i = 0; i < n; i++) {
        float * x = (float *) ((char *) dst->data  + i*( dst->nb[1]));
        float * y = (float *) ((char *) src0->data  + i*( dst->nb[1]));
        float * z = (float *) ((char *) src1->data  + i*( dst->nb[1]));

        ggml_vec_mul_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])),
                (float *) ((char *) src1->data + i*(src1->nb[1])));
    }
}

The core of the operation is in the for loop. In this loop we are dealing with the result tensor x, the multiplying term src0, and the multiplier src1. In particular:

(char *) dst->data converts the data pointer of dst to a char*. This is done because pointer arithmetic should be performed in bytes, and char* is the most flexible type for this purpose.
i * (dst->nb[1]) calculates the offset in bytes for the current row. Since i increments in each iteration, this effectively moves to the next row in memory, taking into account the striding information.
Finally, (float *) is used to cast the result back to a float* to ensure that these pointers are interpreted as pointers to floating-point values.

In the context of numerical computing and tensor operations, striding refers to the step size, typically measured in bytes, between consecutive elements along a particular dimension of a tensor. Understanding and correctly handling striding is crucial for efficient tensor operations and memory management.

The operation ggml_vec_mul_f32 performs the final multiplication as:

inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i]  = x[i]*y[i];   }

Inline functions are a mechanism provided by C (via the inline keyword) to suggest to the compiler that a particular function should be expanded "in place" at the call site rather than being called as a separate function. When you call a regular function, there is some overhead associated with it. This includes pushing parameters onto the stack, setting up a new stack frame, and performing the return operation. For very small and frequently used functions, this overhead can be relatively expensive compared to the actual computation. Inlining eliminates this overhead because the code is inserted directly at the call site. Inlining allows the compiler to perform optimizations that wouldn't be possible if the function were called normally. For example, when a function is inlined, the compiler can see its code in the context of the caller and optimize accordingly. This might include constant folding, dead code elimination, and other optimizations that can lead to faster code.

A final simple code

We're now ready to implement a full code in ggml, computing the function f = ax² for some values. Under the folder exapmles we can create a new folder called simple_example There we'll have the main file main.cpp:

#include "ggml/ggml.h"
#include "utils.h"

int main(int argc, char ** argv) {
    struct ggml_init_params params = {.mem_size=16*1024*1024,
                                      .mem_buffer=NULL,
                                      };
    // params set up
    struct ggml_context * ctx = ggml_init(params);
    // tensors
    struct ggml_tensor *x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
    // x as a parameter
    ggml_set_param(ctx, x);
    struct ggml_tensor *a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
    struct ggml_tensor *x2 = ggml_mul(ctx, x, x);
    struct ggml_tensor *f = ggml_mul(ctx, a, x2);

    // build the graph for the operations
    struct ggml_cgraph gf = ggml_build_forward(f);
    // set initial params
    ggml_set_f32(x, 2.0f);
    ggml_set_f32(a, 3.0f);
    // compute
    ggml_graph_compute(ctx, &gf);
    printf("k=%fn", ggml_get_f32_1d(f,0));
    // print the graph
    ggml_graph_print(&gf);
    // save the graph
    ggml_graph_dump_dot(&gf, &gf, "final_graph");

    return 0;
}

Within the same folder we'll need a CMakeLists.txt file, so we can compile the code with the ggml library:

#
# simple_example

set(TEST_TARGET simple_example)
add_executable(${TEST_TARGET} main.cpp)
target_link_libraries(${TEST_TARGET} PRIVATE ggml ggml_utils)

Finally add the following line at the end of the file examples/CMakeLists.txt : add_subdirectory(simple_example)

Now, everything is all interconnected and can be correctly compiled and run.

Compilation and run

Back to the ggml folder, as explained in the README.md file, create a folder called build and run the following commands:

mkdir build
cd build
cmake ../
make simple_example

This will compile the ggml library and it will compile and create a binary file for the simple_example example code. We're ready to run our code by simply typing ./bin/simple_example. The code will execute the calculation, as well as it will print out the form of graph information, with all the nodes and leaves and their associated operation. For each operation an estimate of the computational time will be given. Remember, if you want to plot out the final graph you need to run dot -Tpng final_graph -o final_graph.png && open final_graph.png

Final remarks on first part

In this first article, we started to deeply understand how ggml works, and what's its underlying philosophy. In particular, we had a deep dive in:

ggml_context and how memory is initialised and used within the ggml library
How to initialised a new 1D tensor and the protocol implementations within ggml
How the graph computation works, retrieve the graph computation and plot it out
A simple example, initialising a mathematical function and getting back its computational graph

In the next post, we'll be dealing with LLM, in particular Gpt. We'll see how to implement them and use them within ggml and, finally, run GPT models on our laptop.