How I Leveraged Open Source LLMs to Achieve Massive Savings on a Large Compute Project
Introduction
In the world of large language models (LLMs), the cost of computation can be a significant barrier, especially for extensive projects. I recently embarked on a project that required running 4,000,000 prompts with an average input length of 1000 tokens and an average output length of 200 tokens. That's nearly 5 billion tokens! The traditional approach of paying per token, as is common with models like GPT-3.5 and GPT-4, would have resulted in a hefty bill. However, I discovered that by leveraging open source LLMs, I could shift the pricing model to pay per hour of compute time, leading to substantial savings. This article will detail the approaches I took and compare and contrast each of them. Please note that while I share my experience with pricing, these are subject to change and may vary depending on your region and specific circumstances. The key takeaway here is the potential cost savings when leveraging open source LLMs and renting a GPU per hour, rather than the specific prices quoted.
ChatGPT API
I conducted an initial test using GPT-3.5 and GPT-4 on a small subset of my prompt input data. Both models demonstrated commendable performance, but GPT-4 consistently outperformed GPT-3.5 in a majority of the cases. To give you a sense of the cost, running all 4 million prompts using the Open AI API would look something like this:

While GPT-4 did offer some performance benefits, the cost was disproportionately high compared to the incremental performance it added to my outputs. Conversely, GPT-3.5 Turbo, although more affordable, fell short in terms of performance, making noticeable errors on 2–3% of my prompt inputs. Given these factors, I wasn't prepared to invest $7,600 on a project that was essentially a personal endeavor.
Open-Source Models
Open-source models to the rescue! With open-source models, the pricing model is very different: you only pay per hour of compute time. Therefore, your goal is to maximize compute iterations per hour. Also, with solutions like Petals.ml, you can run your compute for free (with limitations)!
After trying various models on Hugging Face, I found that a fine-tuned version of LLama-2 70B called Stable Beluga 2 gave me excellent performance without any fine tuning needed! It performed better than GPT-3.5 Turbo, but slightly below GPT-4. However, this model is still very large and requires a very beefy GPU. It's often best practice to use the smallest model possible to complete your task efficiently. Therefore, I tried out the 7B version of Stable Beluga 2. The performance was not good enough!
Fine-tuning
To increase the performance for my task, I used GPT-4 and Petals.ml (Stable Beluga 2 70B) to generate a fine-tuning dataset. I generated 25K prompt completion pairs using Petals.ml and 2K using GPT-4 API. As a reminder Petals.ml allows you to run open source LLM models for free using bit-torrent technology. The inference time is not great though: it would have taken over a year to do 4mm iterations.
Also, Petals.ml hosts this model for free. Therefore, I utilized Petals.ml to generate 25,000 prompt completion pairs. I also used the GPT-4 API to generate an additional 2K prompt completion pairs. In hindsight, I probably overdid it on the training data. I've read reports showing that as few as 500 fine tuning samples can be enough to improve the performance of an LLM. Anyway, here is the total cost of running 2K prompts using Open AI API:

That's right: $84. Can you guess what happened next? Using this newly acquired 27K prompt completion pairs, I fine-tuned the 7B variant of the smallest LLama 2 7B, and voila, this new model performed extremely well for my use case. The total fine-tuning cost was only $6.6 as I rented an A100 GPU for 6 hours. In the end, my fine-tuned model performed better than GPT-3.5, and slightly worse than GPT-4.
Inference
Now let's talk about inference compute costs. My model requires at least 20GB VRAM, so we'll require at least an RTX 3090. There are four great options currently available for GPU rentals: AWS, LambdaLabs, RunPod and Vast.AI. From my experience, Vast.AI has the best prices, but the worst reliability. AWS is super reliable but is more expensive than the other options. RunPod has great prices and a great UI. I personally wanted to keep my costs as low as possible, so I decided to go with Vast.AI for this project. Vast.AI is a peer-to-peer GPU rental service, so it's likely also the least secure. After scouring Vast.AI for the best price to performance ratio servers available, here are the best options I found:

As shown above, the total cost of inference can be less than $99. In my case, I spun up around 10 servers and computed the number of iterations per second I could yield with each. For this particular project, which leveraged a fine-tuned variant of Llama2–7B, the RTX A5000 was a great option in terms of price to performance. I should also mention that none of this would be possible without the INSANE speeds of the open-source library VLLM. My compute is 100% powered by VLLM under the hood. Without VLLM, the compute time would increase by 20 times! You may also look at the total runtime and laugh – 638 hours is 26 days; however, you could easily parallelize the task across multiple GPUs/servers.
Conclusion
In conclusion, the utilization of open-source LLMs and the shift in the pricing model from paying per token to paying per hour of compute time resulted in substantial cost savings in this large compute project. The total cost, which included prompt generation, fine-tuning, and inference, was a mere $189.58. This is a stark contrast to the costs associated with using models like GPT-4 or GPT-3.5 Turbo, which would have amounted to $167,810 and $7410.42 respectively.
The key takeaway from this project is the significant potential for cost reduction when leveraging open source LLMs. By renting a GPU per hour, the cost is tied to the compute time rather than the number of prompts, which can lead to massive savings, especially for extensive projects. This approach not only makes such projects more financially feasible but also opens up opportunities for individuals and smaller organizations to undertake large compute projects without incurring prohibitive costs.
The success of this project underscores the value and potential of open-source models in the realm of large language models and AI and serves as a testament to the power of innovative, cost-effective solutions in tackling complex computational tasks.
Pricing Note
The pricing information provided in this article is based on my personal experience and is intended to serve as a general comparison. Prices may vary depending on your region and specific circumstances. The key takeaway is the potential cost savings when leveraging open source LLMs and renting a GPU per hour, rather than the specific prices quoted.