Are Prompts Generated by Large Language Models (LLMs) Reliable?

Author:Murphy | View: 29115 | Time: 2025-03-23 18:56:08

Figure 1. An example of performance variability of two different ChatGPT-generated prompts

The rapid development of large language models (LLMs), including ChatGPT and GPT-4, has revolutionized data science. In the past, data scientists typically devoted a substantial amount of time to preparing data, designing models, and fine-tuning them to solve various problems. Nowadays, with the advent of LLMs, we can accomplish many tasks in a pure data-centric manner without spending any effort on modeling (see the data-centric AI framework).

One key idea drives the advancement is prompting, which refers to use of specific input text or questions to guide a language model in generating a desired output. For instance, when summarizing a lengthy article, we can provide the LLM with a prompt, such as "Summarize the above in one sentence", and input the article text. This enables the LLM to generate a concise summary of the article, making it easier for researchers to extract relevant information quickly. The use of prompts has opened up new opportunities in data science, enabling scientists to streamline their workflows and increase their productivity.

Creating effective prompts remains a significant challenge, as even prompts that seem similar can produce vastly different outputs. For example, using "Write a brief summary" or "Provide a concise summary" may lead to substantially different summaries, as illustrated in Figure 1. This variation in output can make it difficult for data scientists to determine which prompt to use to achieve the desired results.

To address the challenge of creating effective prompts, automated prompting can be a viable solution that utilizes LLMs to generate prompt templates directly. For instance, when summarizing clinical notes, one can ask an LLM for prompt suggestions by posing the question "What would be an effective prompt for summarizing clinical notes?" The model can then generate a variety of prompt candidates tailored to this specific task, potentially accelerating the process of effective prompt creation.

LLM-generated prompts are usually unpredictable in terms of their quality, resulting in outputs that exhibit significant variability. This, in turn, necessitates a significant amount of manual effort to examine each candidate prompt individually. In this article, we will introduce a framework named SPeC, to make the LLM-generated prompts more effective and reliable. SPeC exploits soft prompt tokens to calibrate performance variability while preserving performance gain brought by LLM-generated prompts, resulting in notably more consistent outputs.

Prompt Tuning in LLMs

Figure2. Prompt Tuning. Image from the paper https://arxiv.org/abs/2303.10158 with original authors' permission.

Prompt tuning is a revolution to data science following the concpet of data-centric AI. In addition to collecting more training data, prompt tuning is an alternative approach to improve the performance of LLMs without any further fine-tuning. Notably, effective prompts are a critical factor in the success of prompt tuning, as the specific input words can trigger the corresponding information learned by LLMs, resulting in a significant improvement in LLMs' adaptation and performance on specific downstream tasks. Data scientists and researchers can benefit greatly from this approach as it enables them to efficiently and effectively utilize LLMs in various downstream tasks. It has also been advocated by Jeff Dean, a leading director of Google Research.

How to Automatically Generate Prompts?

Designing an effective prompt is never a trivial task, as tremendous domain-specific expertise is still required to extract certain keywords and sentences to form the prompts. The advent of powerful LLMs has made it possible for users to increase their productivity in designated tasks by taking advantage of prompts that are automatically generated. When users input a question into an LLM, it can generate corresponding prompt templates. For instance, a data scientist could ask ChatGPT for guidance on a good prompt for text summarization, and then utilize the resulting feedback to summarize text. This approach can significantly streamline workflows, saving users considerable time and effort.

Are Automatically Generated Prompts Reliable?

However, the quality of prompts generated by LLMs can be highly unpredictable, which in turn leads to a significant increase in the performance variance of LLMs. Even when prompts are semantically similar, they can produce vastly different outputs. For instance, as demonstrated in Figure 1, prompt-2 and prompt-1, generated from a frozen LLM and highly similar to each other, resulted in entirely different summarization. This issue is particularly problematic in high-stakes domains, such as the financial and Healthcare industries, where the variance in generated prompts can erode trust in LLMs' results among researchers and engineers. Therefore, it is critical to find ways to control the quality of prompts generated by LLMs to ensure the reliability of their outputs, especially in such domains.

Can We Trust the Results from the Generated Prompts?

In reality, the answer is a clear negative. The uncertainty that frequently arises in LLMs is a significant issue for scientists who need to trust the output produced by these models. If significant uncertainty also occurs in LLM-generated prompts, it can considerably erode scientists' confidence in the results. Therefore, it is essential to have a mechanism in place that reduces the output variance caused by the quality of these auto-generated prompts in order to ensure that LLMs work more reliably.

A Soft Prompt-Based Calibration on LLMs

Figure 3. An overview of soft prompt-based calibration (SPeC) framework. Image from the paper https://arxiv.org/abs/2303.10158 with original authors' permission.

Full Paper Link

Motivated by prompt tuning from data-centric AI concepts, a framework Soft Prompt-Based Calibration (SPeC), as depicted in Figure 3, discusses the techniques to reduce the outcome variance of different prompts. SpeC framework exploits soft prompt tokens to calibrate performance variability while preserving performance gain brought by LLM-generated prompts. The soft prompt tokens can be any sentence that is semantically related to the input text. For example, "radiologist describes the stable abnormality in the exam" can be good soft prompt tokens for clinical note summarization. This way, given a well-trained soft prompt encoder, by adding soft prompt tokens with the input text, we will be able to achieve stable inference outcomes of LLMs. For instance, medical doctors can easily provide the appropriate soft prompt tokens by using relevant keywords or terms to get desired outcomes with consistency.

Experimental Analytics on Clinical Note Summarization

SPeC framework is evaluated on an important healthcare task, the clinical note summarization for medical doctors. In this work, the LLM-generated prompts are collected by the asking the question, "What is a good prompt for clinical note summarization?", to ChatGPT.

SPeC effectively guides pre-trained LLMs that have been frozen in place to perform with less variability in clinical note summarization. This ensures that the LLMs can maintain the performance improvements gained from using prompts generated by ChatGPT, while also reducing variability in performance to make sure the resulting clinical summaries are more accurate and faithful to the original data.

The effectiveness of SPeC in maintaining consistent summarization performance in frozen pre-trained LLMs was demonstrated in their case study, which highlighted the potential for incorrect outcomes (highlighted in red) if SPeC was not used. The study's results are displayed in Figure 4.

Figure 4. Performance variability comparison of Flan-T5 w/ and w/o exploiting SPeC.

How Can SPeC Framework Be Used in Daily Workflow?

In the era of data-centric AI, LLMs have the potential to revolutionize data science by providing fast and accurate analysis with prompt tuning techniques, leading to more efficient and effective workflow. However, several concerns about the uncertainty of LLMs' outputs have been raised, especially in situations where critical and emergent decisions are needed to make. It is important to address these concerns to ensure that LLMs can be effectively integrated into AI systems.

SPeC framework has effectively mitigated the uncertainty concerns raised by scientists while using LLMs, increasing their willingness to trust the decisions made by LLMs. For example, for biomedical data scientists, the success of the SPeC framework in providing dependable and consistent medical information summaries has the potential to empower healthcare practitioners to make informed decisions for optimal patient care.

Resource

You can learn more about how SPeC helps in the healthcare industry and increase the willingness of healthcare experts to trust the decisions made by LLMs in the following papers:

If you are interested in how to apply SPeC on different downstream tasks. Some more instructions can be found in the Github Repository.

Tags: Healthcare Large Language Models Machine Learning Naturallanguageprocessing Prompt