Four Simple Steps to Build a Custom Self-hosted Llama3 Application

Author:Murphy | View: 29842 | Time: 2025-03-22 21:52:12

Background: LLM Applications

In recent months, we have witnessed a surge in solutions for developing Large Language Model (LLM) applications. Here are the popular approaches.

1. Cloud-based online platforms

Platforms like OpenAI's GPT4 store and Huggingface Space allow developers to focus on prompt engineering and interaction designing without configuring hardware, environment, and web framework. However, they have the following limitations:

Privacy related to individual or commercial information.
Latency due to remote servers and shared GPU resource pools.
Cost for remote Application Programming Interface (API) calls or on-demand servers.

2. Managed self-hosted applications

Self-hosted applications relying on a managed stack or framework like Ollama+OpenWebUI offer ready-to-use templates for running various LLM applications locally. This solution draws attentions because the state-of-the-art Llama 3 (8B) model can easily run on a PC with a 16G GPU. While the solution is limited by:

Complexity in setup and maintenance.
Inflexibility due to limited customization.

3. Custom self-hosted applications

To overcome the limitations of the managed self-hosted solution, an alternative is to create custom self-hosted applications, which use custom-built components across the stack.

This article will introduce this solution by using a simple but common case: developing a Llama 3 chat assistant locally and make it publicly accessible. We ensure the customization by directly writing codes with Huggingface's inference API, Tornado web framework, and native JS/HTML/CSS. Finally, the application is exposed to the public using the Ngrok reverse proxy. Let's start this minimal "full-stack" development process!

Figure 1. Solution for a custom self-hosted application. Source: by author.

Steps for Developing a Custom Self-hosted App

1. LLM development: inference pipeline using Huggingface

The first step is to create a model inference pipeline to generate chat responses. We load the instruction-tuned Llama-8B model with Huggingface's transformer library. To fit the model in a 16G GPU, we apply 4-bit quantization.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Quantization for decreasing memory consumption
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir="models",
    quantization_config=quantization_config
)

The generate_response function executes the model inference pipeline. Instead of merely responding to the latest message, a chat interacts with users according to the comprehensive conversation history. This feature, known as context management, can be simply implemented by maintaining a dict of message lists for different users (conversation_histories). Remember to adapt the message to the instruction format by applying tokenizer.apply_chat_template.

def generate_response(user_id: str,
                      new_message: str,
                      conversation_histories: dict,
                      tokenizer,
                      model):
    # Retrieve the past messages for this user
    history = conversation_histories.get(user_id, [])

    # Add the new user message to the history
    history.append({"role": "user", "content": new_message})

    # Apply the chat template to make the message follow the instruction format
    input_ids = tokenizer.apply_chat_template(
        history,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    try:
        # Generate a response from the model
        outputs = model.generate(
            input_ids,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.6,
            top_p=0.9,
            eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
        )

        # Extract the model's latest response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response.split("assistantn")[-1].strip()

        # Append system response to history
        history.append({"role": "system", "content": response})
        conversation_histories[user_id] = history

        return response
    except Exception as e:
        print(f"Error during model interaction: {e}")
        return "Sorry, I encountered an error. Could you please rephrase or try again later?"

2. Back-end development: web service using Tornado

A web framework delivers the model inference results to a web service. Here we use Tornado, which creates web service APIs by implementing request handlers. The MainHandler renders the web page once it is accessed and the ChatHandler sends user's message to invoke the LLM to response. Once the program below runs, a web service is launched to listen to incoming requests consistently.

import tornado.ioloop
import tornado.web
import json

# Store conversation histories for context management
conversation_histories = {}

class MainHandler(tornado.web.RequestHandler):

    def get(self):
        self.render("web/index.html")

class ChatHandler(tornado.web.RequestHandler):

    def post(self):
        data = json.loads(self.request.body)
        user_id = data.get('user_id', '')
        user_input = data.get('message', '')
        response = generate_response(user_id, user_input, conversation_histories, tokenizer, model)
        self.write({'response': response})

def make_app():
    return tornado.web.Application([
        (r"/", MainHandler),
        (r"/chat", ChatHandler),
        (r"/(style.css)", tornado.web.StaticFileHandler, {"path": "web/"}),
        (r"/(script.js)", tornado.web.StaticFileHandler, {"path": "web/"})
    ])

if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    tornado.ioloop.IOLoop.current().start()

3.Front-end development: chat UI using native JS/HTML/CSS

A front-end chat UI works with the back-end web service to create an interactive app. Although there are many low-code UI tools in Python such as Panel and PyWebIO, using native JS/HTML/CSS code is easy in our demo. The simple UI layout consists of dynamic dialog boxes and a static message input, as shown in Figure 2.

Figure 2. LLM assistant UI. Source: by author.

The html and js files are presented below. In the html file, the showdown.min.js library is imported to embed output markdown content in dialog boxes. The js file consists of four parts: (1) a getUserId function to create or access the current user ID, (2) a sendMessage function to feed messages to the LLM model and get responses, (3) a addMessageToChatbox function to append new responses to dialog boxes, and (4) a addEventListener function to display the assistant's initial introductory message.




    
    
    Chatbot
    
    <script src="https://cdn.jsdelivr.net/npm/showdown/dist/showdown.min.js"></script>


    
        
        
        
    
    <script src="script.js"></script>

function getUserId() {
    let userId = localStorage.getItem('user_id');
    if (!userId) {
        userId = 'uid' + Math.random().toString(36).substr(2, 9); // Generate a simple UID
        localStorage.setItem('user_id', userId);
    }
    return userId;
}

function sendMessage() {
    var userInput = document.getElementById("user_input");
    var message = userInput.value.trim();
    var userId = getUserId(); // Retrieve user ID when needed

    if (message === "") {
        return;
    }

    addMessageToChatbox(message, "user-message");

    fetch('/chat', {
        method: 'POST',
        body: JSON.stringify({ 'user_id': userId, 'message': message}),
        headers: {
            'Content-Type': 'application/json'
        }
    }).then(response => response.json())
    .then(data => {
        addMessageToChatbox(data.response, "system-message");
    }).catch(error => {
        console.error('Error:', error);
    });

    userInput.value = "";
    userInput.focus();
}

function addMessageToChatbox(message, className) {
    var chatbox = document.getElementById("chatbox");
    var msgDiv = document.createElement("div");
    msgDiv.className = "message " + className;
    var converter = new showdown.Converter();
    var html = converter.makeHtml(message);
    msgDiv.innerHTML = html;
    chatbox.appendChild(msgDiv);
    chatbox.scrollTop = chatbox.scrollHeight;
}

document.addEventListener('DOMContentLoaded', function() {
    addMessageToChatbox("Hello! I am your local LLM assistant. How can I help you today?", "system-message");
});

4. Deployment: proxy server using Ngrok

To expose the LLM application to the Internet via a public URL, typical approaches include port forwarding, virtual private network, and reverse proxy server. The most convenient way is the reverse proxy server, which directs user requests to the appropriate back-end server without modifying network settings or dealing with potential security risks. Here we introduce Ngrok, a popular reverse proxy tool to deliver apps and APIs. Steps for deploying our LLM application are as follows:

Download the Ngrok command line interface (CLI) from https://ngrok.com/download.
Connect your account in the CLI:

ngrok config add-authtoken

Put the app online by creating a random public URL and mapping it to the local API http://localhost:8888:

ngrok http http://localhost:8888

Check the pubic URL from the "Forwarding" info in the CLI and monitor the requests:

Figure 3. Creating a public URL for a local service in Ngrok. Source: by author.

Finally, after deployment, I access the LLM application via the public URL on my cellphone, as shown in Figure 4.

Figure 4. Web-based LLM app on a cellphone. Source: By author.

Summary

Custom self-hosted Llm Applications optimize performance and cost while maintaining customization. By leveraging the Huggingface API, Tornado, JS/HTML/CSS, and Ngrok, this article demonstrates four simple steps to build a Llama 3 chat assistant app by custom handling each phase from backend to frontend. It presents a minimal solution to develop a personal LLM app demo with high flexibility and extensibility. I look forward to exploring more tools for accelerating LLM app development.

Tags: Hugging Face Llama 3 Llm Applications Ngrok Self Hosted