What I Learned Pushing Prompt Engineering to the Limit
I spent the past two months building a large-language-model (LLM) powered application. It was an exciting, intellectually stimulating, and at times frustrating experience. My entire conception of Prompt Engineering – and of what is possible with LLMs – changed over the course of the project.
I'd love to share with you some of my biggest takeaways with the goal of shedding light on some of the often unspoken aspects of prompt engineering. I hope that after reading about my trials and tribulations, you will be able to make more informed prompt engineering decisions. If you'd already dabbled in prompt engineering, I hope that this helps you push forward in your own journey!
For context, here is the TL;DR on the project we'll be learning from:
- My team and I built VoxelGPT, an application that combines LLMs with the FiftyOne computer vision query language to enable searching through image and video datasets via natural language. VoxelGPT also answers questions about FiftyOne itself.
- VoxelGPT is open source (so is FiftyOne!). All of the code is available on GitHub.
- You can try VoxelGPT for free at gpt.fiftyone.ai.
- If you're curious how we built VoxelGPT, you can read more about it on TDS here.
Now, I've split the prompt engineering lessons into four categories:
General Lessons
Science? Engineering? Black Magic?
Prompt engineering is as much experimentation as it is engineering. There are an infinite number of ways to write a prompt, from the specific wording of your question, to the content and formatting of the context you feed in. It can be overwhelming. I found it easiest to start simple and build up an intuition – and then test out hypotheses.
In computer vision, each dataset has its own schema, label types, and class names. The goal for VoxelGPT was to be able to work with any computer vision dataset, but we started with just a single dataset: MS COCO. Keeping all of the additional degrees of freedom fixed allowed us to nail down into the LLM's ability to write syntactically correct queries in the first place.
Once you've determined a formula that is successful in a limited context, then figure out how to generalize and build upon this.
Which Model(s) to Use?
People say that one of the most important characteristics of large language models is that they are relatively interchangeable. In theory, you should be able to swap one LLM out for another without substantially changing the connective tissue.
While it is true that changing the LLM you use is often as simple as swapping out an API call, there are definitely some difficulties that arise in practice.
- Some models have much shorter context lengths than others. Switching to a model with a shorter context can require major refactoring.
- Open source is great, but open source LLMs are not as performant (yet) as GPT models. Plus, if you are deploying an application with an open source LLM, you will need to make sure the container running the model has enough memory and storage. This can end up being more troublesome (and more expensive) than just using API endpoints.
- If you start using GPT-4 and then switch to GPT-3.5 because of cost, you may be shocked by the drop-off in performance. For complicated code generation and inference tasks, GPT-4 is MUCH better.
Where to Use LLMs?
Large language models are powerful. But just because they may be capable of certain tasks doesn't mean you need to – or even should – use them for those tasks. The best way to think about LLMs is as enablers. LLMs are not the WHOLE solution: they are just a part of it. Don't expect large language models to do everything.
As an example, it may be the case that the LLM you are using can (under ideal circumstances) generate properly formatted API calls. But if you know what the structure of the API call should look like, and you are actually interested in filling in sections of the API call (variable names, conditions, etc.), then just use the LLM to do those tasks, and use the (properly post-processed) LLM outputs to generate structured API calls yourself. This will be cheaper, more efficient, and more reliable.
A complete system with LLMs will definitely have a lot of connective tissue and classical logic, plus a slew of traditional software engineering and ML engineering components. Find what works best for your application.
LLMs Are Biased
Language models are both inference engines and knowledge stores. Oftentimes, the knowledge store aspect of an LLM can be of great interest to users – many people use LLMs as search engine replacements! By now, anyone who has used an LLM knows that they are prone to making up fake "facts" – a phenomenon referred to as hallucination.
Sometimes, however, LLMs suffer from the opposite problem: they are too firmly fixated on facts from their training data.
In our case, we were trying to prompt GPT-3.5 to determine the appropriate ViewStages (pipelines of logical operations) required in converting a user's natural language query into a valid FiftyOne Python query. The problem was that GPT-3.5 knew about the Match
and FilterLabels
ViewStages, which have existed in FiftyOne for some time, but its training data did not include recently added functionality wherein a SortBySimilarity
ViewStage can be used to find images the resemble a text prompt.
We tried passing in a definition of SortBySimilarity
, details about its usage, and examples. We even tried instructing GPT-3.5 that it MUST NOT use the Match
or FilterLabels
ViewStages, or else it will be penalized. No matter what we tried, the LLM still oriented itself towards what it knew, whether it was the right choice or not. We were fighting against the LLM's instincts!
We ended up having to deal with this issue in post-processing.
Painful Post-Processing Is Inevitable
No matter how good your examples are; no matter how strict your prompts are – large language models will invariably hallucinate, give you improperly formatted responses, and throw a tantrum when they don't understand input information. The most predictable property of LLMs is the unpredictability of their outputs.
I spent an ungodly amount of time writing routines to pattern match for and correct hallucinated syntax. The post-processing file ended up containing almost 1600 lines of Python code!
Some of these subroutines were as straightforward as adding parenthesis, or changing "and" and "or" to "&" and "|" in logical expressions. Some subroutines were far more involved, like validating the names of the entities in the LLM's responses, converting one ViewStage to another if certain conditions were met, ensuring that the numbers and types of arguments to methods were valid.
If you are using prompt engineering in a somewhat confined code generation context, I'd recommend the following approach:
- Write your own custom error parser using Abstract Syntax Trees (Python's ast module).
- If the results are syntactically invalid, feed the generated error message into your LLM and have it try again.
This approach fails to address the more insidious case where syntax is valid but the results are not right. If anyone has a good suggestion for this (beyond AutoGPT and "show your work" style approaches), please let me know!
Prompting Techniques
The More the Merrier
To build VoxelGPT, I used what seemed like every prompting technique under the sun:
- "You are an expert"
- "Your task is"
- "You MUST"
- "You will be penalized"
- "Here are the rules"
No combination of such phrases will ensure a certain type of behavior. Clever prompting just isn't enough.
That being said, the more of these techniques you employ in a prompt, the more you nudge the LLM in the right direction!
Examples > Documentation
It is common knowledge by now (and common sense!) that both examples and other contextual information like documentation can help elicit better responses from a large language model. I found this to be the case for VoxelGPT.
Once you add all of the directly pertinent examples and documentation though, what should you do if you have extra room in the context window? In my experience, I found that tangentially related examples mattered more than tangentially related documentation.
Modularity >> Monolith
The more you can break down an overarching problem into smaller subproblems, the better. Rather than feeding the dataset schema and a list of end-to-end examples, it is much more effective to identify individual selection and inference steps (selection-inference prompting), and feed in only the relevant information at each step.
This is preferable for three reasons:
- LLMs are better at doing one task at a time than multiple tasks at once.
- The smaller the steps, the easier to sanitize inputs and outputs.
- It is an important exercise for you as the engineer to understand the logic of your application. The point of LLMs isn't to make the world a black box. It is to enable new workflows.
Examples
How Many Do I Need?
A big part of prompt engineering is figuring out how many examples you need for a given task. This is highly problem specific.
For some tasks (effective query generation and answering questions based on the FiftyOne documentation), we were able to get away without any examples. For others (tag selection, whether or not chat history is relevant, and named entity recognition for label classes) we just needed a few examples to get the job done. Our main inference task, however, has almost 400 examples (and that is still the limiting factor in overall performance), so we only pass in the most relevant examples at inference time.
When you are generating examples, try to follow two guidelines:
- Be as comprehensive as possible. If you have a finite space of possibilities, then try to give the LLM at least one example for each case. For VoxelGPT, we tried to have at the very least one example for each syntactically correct way of using each and every ViewStage – and typically a few examples for each, so the LLM can do pattern matching.
- Be as consistent as possible. If you are breaking the task down into multiple subtasks, make sure the examples are consistent from one task to the next. You can reuse examples!
Synthetic Examples
Generating examples is a laborious process, and handcrafted examples can only take you so far. It's just not possible to think of every possible scenario ahead of time. When you deploy your application, you can log user queries and use these to improve your example set.
Prior to deployment, however, your best bet might be to generate synthetic examples.
Here are two approaches to generating synthetic examples that you might find helpful:
- Use an LLM to generate examples. You can ask the LLM to vary its language, or even imitate the style of potential users! This didn't work for us, but I'm convinced it could work for many applications.
- Programmatically generate examples – potentially with randomness – based on elements in the input query itself. For VoxelGPT, this means generating examples based on the fields in the user's dataset. We are in the process of incorporating this into our pipeline, and the results we've seen so far have been promising.
Tooling
LangChain
LangChain is popular for a reason: the library makes it easy to connect LLM inputs and outputs in complex ways, abstracting away the gory details. The Models and Prompts modules especially are top notch.
That being said, LangChain is definitely a work in progress: their Memories, Indexes, and Chains modules all have significant limitations. Here are just a few of the issues I encountered when trying to use LangChain:
- Document Loaders and Text Splitters: In LangChain, Document Loaders are supposed to transform data from different file formats into text, and Text Splitters are supposed to split text into semantically meaningful chunks. VoxelGPT answers questions about the FiftyOne documentation by retrieving the most relevant chunks of the docs and piping them into a prompt. In order to generate meaningful answers to questions about the FiftyOne docs, I had to effectively build custom loaders and splitters, because LangChain didn't provide the appropriate flexibility.
- Vectorstores: LangChain offers Vectorstore integrations and Vectorstore-based Retrievers to help find relevant information to incorporate into LLM prompts. This is great in theory, but the implementations are lacking in flexibility. I had to write a custom implementation with ChromaDB in order to pass embedding vectors ahead of time and not have them recomputed every time I ran the application. I also had to write a custom retriever to implement the custom pre-filtering I needed.
- Question Answering with Sources: When building out question answering over the FiftyOne docs, I arrived at a reasonable solution utilizing LangChain's
RetrievalQA
Chain. When I wanted to add sources in, I thought it would be as straightforward as swapping out that chain for LangChain'sRetrievalQAWithSourcesChain
. However, bad prompting techniques meant that this chain exhibited some unfortunate behavior, such as hallucinating about Michael Jackson. Once again, I had to take matters into my own hands.
What does all of this mean? It may be easier to just build the components yourself!
Vector Databases
Vector search may be on