A New Method to Detect "Confabulations" Hallucinated by Large Language Models

Author:Murphy | View: 21910 | Time: 2025-03-22 21:03:14

As you surely know, AI has made huge strides in the last two years with the development and mass-scale deployment of large language models (LLMs). These models appear to have an impressive capability for reasoning and for question-answering; however, a persistent challenge lies in their tendency to "hallucinate", that is to generate outputs with false or arbitrary content. These hallucinations can have severe consequences, therefore much of the current research in LLM development seeks to suppress them as much as possible. Towards this end, a new paper presents a method called "semantic entropy" that identifies and mitigates specific kinds of hallucinations arising in LLMs due to simply lack of sufficient knowledge, so-called "confabulations". No need to say, this is all very useful for a more reliable use of LLMs in several applications, to not say in all applications that require factual knowledge. Interestingly, quantification of semantic entropy on an LLM's generations requires the application of a second LLM to assess the similarity of the LLM-generated sequences. Read on to know more details and see some examples.

Quick index

_**- Introduction

Detecting a subset of "hallucinations" known as "confabulations"
The concept of semantic entropy – But how to compute semantic similarity? With another LLM! – Results on Modern LLMs
Conclusion – Enhancing Trust and Reliability– References**_

Introduction

When GPT-3 took over everybody's attention, even before ChatGPT came out, I got obsessed with the possibility of having actual, real, true Artificial Intelligence at our fingertips. Although of course I was aware that large language models (LLMs) are merely statistical models processing text tokens, they seemed to be so "smart" at first sight, that I couldn't just let go. And obviously, as things progressed, the major LLM developers couldn't either.

A pressing problem with LLMs is that they are very prone to hallucination on a slightly deeper inspection, with all the problems this brings as already discussed to exhaustion (fake news fabrication, incorrect information, providing harmful content, etc.). Thus, OpenAI and other LLM developers kept moving forward in their quest for "smarter" and more "sensitive" LLMs that could (i) provide more accurate and reliable knowledge, and (ii) be capable of "realizing" when they weren't sure about the information being provided rather than just hallucinating, as well as "realizing" about potential harms of information to be provided, even when accurate (for example how to make a bomb).

As developers did their work, I eagerly followed their papers as well as those of third parties who constantly probe the new LLMs for their capabilities and limitations. For example, early on when GPT-3 was out, I explored how token probabilities were somewhat informative about the quality of its outputs:

Exploring Token Probabilities as a Means to Filter GPT-3's Answers

Two other excellent pieces at the time were these works showing how optimized prompts can lead to much better answers by most LLMs common at that time, and also a provocative work from Microsoft suggesting that GPT-4 displayed sparks of intelligence (of course a catchy title, yet with surprising information inside!):

New DeepMind Work Unveils Supreme Prompt Seeds for Language Models

Provocatively, Microsoft Researchers Say They Found "Sparks of Artificial Intelligence" in GPT-4

Now, it is time to dissect a new work that directly addresses the problem of identifying hallucinations by LLMs, particularly a specific kind of hallucination that the authors of this work call "confabulations":

Detecting hallucinations in large language models using semantic entropy – Nature

Detecting a subset of "hallucinations" known as "confabulations"

Before going to the main point of this paper, there's an interesting explanation in its opening that then has to do with what the new method really deals with.

In brief and rephrasing how the authors of the work expose this, hallucinations can be of various types, at least:

"Confabulations" whereby LLMs fluently make claims that are both wrong and arbitrary, usually very sensitive to irrelevant details such as the random seed used.
Cases in which a similar "symptom" is caused by the LLM being consistently wrong because it was trained on erroneous data.
Cases in which the LLM "lies" (or gets "lazy" as some have reported) in pursuit of a reward.
Systematic failures of reasoning or generalization, especially when logic or maths are involved.

The authors of the paper I'm presenting here believe that combining these distinct hallucination mechanisms while working on them is unhelpful, given their very different nature and primary source. Thus, they focus specifically on confabulations, developing a method that detects them. This would be particularly useful to assist human users in interpreting the outputs from LLMs, saving them against a good portion of the sources of hallucinations.

By design, the new method does not guarantee factuality because it does not address the sources of hallucination related to training with incorrect information or incorrect steps in logic and maths. However, the fact is that by applying the proposed method one can significantly improve question-answering accuracy for state-of-the-art LLMs, which is in itself useful and also reveals that confabulations are a major source of error at present.

The Concept of Semantic Entropy

To detect confabulations, this new work uses probabilistic tools to define and measure the "semantic entropy" of the outputs provided by the LLMs. Semantic entropy is computed over the meanings of sentences rather than their lexical or syntactic forms; and a high semantic entropy indicates high uncertainty in the generated content. This semantic entropy is roughly relatable to the amount of variation in repeated answers to the same question, and the main idea behind this method is that more variability in the answers is indicative of uncertainly in their contents.

Looks simple, but there's a twist. Traditionally, entropy in text generation is difficult to measure because answers that mean the same thing might be expressed in different ways, leading to misleadingly high naive entropy estimates. Semantic entropy, however, addresses this by clustering answers that share the same meaning before calculating the entropy. This method moves towards estimating the entropy of the distribution of meanings rather than the distribution over tokens, offering a more accurate reflection of an LLM's confidence in its responses.

The process of measuring semantic entropy for a given LLM involves several steps. First, the LLM is used to generate multiple answers to a given question. Then, the answers are algorithmically clustered based on their meanings. Then, a kind of regular text entropy is calculated over these clusters, providing a measure of the LLM's uncertainty regarding the meaning of its answers. High semantic entropy in the answers to a given question indicates a potential confabulation.

Semantic entropy has been rigorously tested across a variety of fields, demonstrating its versatility and robustness. Very importantly, the method does not require labeled examples of errors, making it an unsupervised technique that can generalize effectively to new tasks and domains. I will show you an example of how it works, right in the next section.

But how to compute semantic similarity? With another LLM!

A question might have come up in your mind as you read above. "Naive" entropy is rather simple; you just compute the entropy across the different answers provided by the LLM just treating them as bags of words. But how did the authors of this work calculate "semantic entropy", which requires understanding the contents of the different LLM generations to tell whether they mean the same or not?

Well, for this, the authors used another LLM, which was provided with the different text generations and asked to say whether they contained the same information or not. More precisely, the second LLM was used to cluster the possible answers into semantically equivalent groups. Such clustering comes with some degree of uncertainty, and the authors showed that this uncertainty is a better effective estimate of the first LLM's uncertainty (or probability of just being confabulating) than the naive entropy computed from words without taking into account the sentence's meaning.

So, somehow, we could say that the second LLM run is assessing the reliability of the first LLM run. Moreover, the work uses a third LLM call to decide whether the first LLM's response matches the right answer provided by a human.

Results on Modern LLMs

The empirical evaluations presented in the paper involved context-free sentence-length answers of 100 to 400 characters of two main kinds: Question Answering + Maths on one side, and Analysis of Biographies on the other. They tested several LLMs.

On question answering and math problems over 30 tasks, semantic entropy significantly outperformed alternative methods to evaluate confabulations / hallucinations. Semantic entropy markedly surpasses the naive estimation of uncertainty, which computes the (regular) entropy of the generated text. Importantly, as the authors already anticipate in the introduction of their paper, naive estimates of uncertainty fail to account for the fact that different phrasings can convey the same meaning, leading to inflated uncertainty estimates.

Let's look at an example that I'm taking from Table 1 of the paper. Given the following 3 LLM-generated answers to the question "Refineries, process chemical, power generation, mills and manufacturing plants are under what sector of construction?": • "All the above are under the industrial sector of construction." • "The refineries, process chemical, power generation, mills and manufacturing plants are under the industrial sector of construction." • "These are all under the heavy industrial sector of construction."

which are all correct, then naive entropy predicts this is a confabulation, and this arises from the high variability in the text of the three answers despite essentially having the same (correct) information. Instead, semantic entropy grasps that the meaning of the sentences is indeed essentially the same, therefore classifies this as not being a confabulation.

Applying semantic entropy to biographies, which are naturally longer text generations than for question answering or math problems, presented the challenge that different parts of the long text could contain correct or incorrect information. The authors then broke down the biographies into factual claims which were then manually labeled as true or false and processed to finally compute an aggregated entropy score. Again, semantic entropy outperformed other methods to detect confabulations / hallucinations.

Conclusion

Downstream analysis of the results in the paper suggests that this new method based on semantic entropy outperforms traditional error detection techniques in several ways. Unlike naive entropy or supervised embedding regression, which often fail to account for semantically equivalent but lexically distinct answers, semantic entropy provides a more accurate measure by focusing on the meaning. This is particularly beneficial for complex tasks, such as generating biographies, where subtle errors can be easily overlooked.

Note that this new method is unsupervised, not requiring labeled examples of confabulations – unlike supervised methods which assume questions will preserve the patterns learned from examples, a very a risky assumption in new situations or with confabulations that human overseers might miss. Importantly, detection of confabulations with the new method is also more robust than with supervised methods.

Enhancing Trust and Reliability

The practical implications of this work are profound, because by better detecting at least part of LLMs' hallucinations we can now better recognize and flag potentially erroneous and problematic generations. Of course future LLMs could evolve in ways that again require new methods evolving in parallel.

The success of semantic entropy in detecting hallucinations and improving the reliability of LLMs opens up new possibilities for further research and direct applications. In particular, this method should be rather easily adapted to practical and highly useful applications of LLMs such as summarization or fact checking, automatic code writing, and similar uses of the kinds I've presented in some of my articles:

Direct Uses of Large Language Models in Modern Science and Technology

Powerful Data Analysis and Plotting via Natural Language Requests by Giving LLMs Access to…

It would also be interesting to compare this kind of methods meant to tell hallucinations against research working on cognition and speculating that there could be some relationships between the development of language understanding by humans and by computers, both linked to the emergence of true intelligence. A fascinating topic that even questions what intelligence is in itself, and where the ability to tell confabulations from well-grounded answers and reasoning is probably key:

Large Language Models in Light of the Turing Test and the Chinese Room Argument

Revolving on the Turing test, the Chinese room argument, and modern Large Language Models

If Oral and Written Communication Made Humans Develop Intelligence… What's Up with Language Models?

References

The main reference describing the new method for telling hallucinations due to lack of knowledge:

Detecting hallucinations in large language models using semantic entropy – Nature

A comment on that paper:

‘Fighting fire with fire' – using LLMs to combat LLM hallucinations

Related papers that I read while preparing this blog post, that I found very interesting:

Testing theory of mind in large language models and humans – Nature Human Behaviour

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared…

ThoughtSource: A central hub for large language model reasoning data – Scientific Data

Some other articles of mine revolving around LLMs:

Web Speech API: What Works, What Doesn't, and How to Improve It by Linking It to a GPT Language…

www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, Programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.

Tags: Artificial Intelligence ChatGPT Large Language Models Machine Learning Programming