GPTs and the Forehead Detective
Whenever I take long road trips with my family, we like to spend the time playing various spoken word games, especially those that don't require too much focus from me as the driver. One of our favorite games is a character-guessing game. The rules are simple: one person thinks of a character – real or fictional – and the rest of us try to guess who it is. We can only ask closed questions that can be answered with a simple "yes" or "no". The person who thought of the character must respond with just those answers. We take turns asking questions, and the winner is the one who correctly guesses the character first. So it's quite like the "Forehead Detective" game, but without the Post-It notes attached to our heads.
During one of our recent trips, we were playing this game, but after some time we started to run out of ideas for characters to guess. That's when I thought, "Why not let ChatGPT take over this task?". I had the ChatGPT app installed on my phone, so I launched it and asked ChatGPT to think of a character for us to guess. I also instructed it to answer our guesses about the character. And so, we started playing with ChatGPT. It came up with a character, and we took turns asking questions. The AI responded to each question with either "yes" or "no". All of its answers seemed logical and guided us closer to identifying the correct character. The ending of the game went like this (we already learned that the character is a bird from a movie):
- Is the bird the main character in the movie it appears in?
- Yes.
- Is the bird a toy?
- Yes.
- Is it from Toy Story movie?
- Yes.
Here we gave up. We were not aware of any toy bird from Toy Story. I asked the GPT to reveal the character. Its response was:
- I was thinking of the Elephant, a character from "Toy Story." Just kidding! I meant Buzz Lightyear dressed as a chicken in "Toy Story 2" – a cameo "interactive" character with big boots, a hat, and two guns.
That stunned me and my family. I asked it to clarify.
- Sorry for the confusion! I meant a bird named "Wojtek" from "Toy Story 3." However, there isn't such a bird. I think I got lost somewhere along the way with the questions. If you want, we can try again!
That was quite unexpected. My kids got annoyed, but the experience got me thinking. On the one hand, ChatGPT seemed to be responding to our questions very smoothly and effectively. On the other hand, it ended in complete failure when it came to the final result of our game. This led me to reflect that language models might not be as proficient in reasoning and drawing conclusions as we might expect. Perhaps this type of game could serve as a good test to evaluate the quality of reasoning and inference in language models. With just one change: reversing the roles. I would come up with a character to be guessed, and the task of the language model would be to ask the questions that would lead to guessing that character. This way, I could measure how quickly a given language model can figure out the character and also analyze the reasoning process – how the model narrows down the characteristics, the types of questions it asks, whether it gets lost at any point, whether the questions form a coherent sequence, and so on.
Looking for a Related Research
So, I decided to take action. However, before I started implementing my own application to conduct this type of evaluation on language models, I first decided to check whether there were already any other scientific studies that had used this type of game to evaluate language models. I have found two such research papers, that explored this approach, providing a foundation for understanding how LLMs handle tasks requiring complex reasoning and contextual understanding.
Study #1: Probing the Multi-turn Planning Capabilities of LLMs
via 20 Question Games
This research from February 2024 explored the potential of using word guessing games, akin to the classic "20 Questions" game, to evaluate the reasoning abilities of several LLMs, including commercial ones (GPT-4, GPT-3.5) and open-source models (Claude-1/2, Vicuna 7B/13B, Mistral 7B).
In this study, the researchers tasked the LLMs with identifying a target word based on a series of yes/no questions. The performance of each model was measured based on its ability to accurately guess the word within a limited number of questions, as well as the relevance and coherence of the questions it asked. The findings highlighted significant differences in performance between open-source models and more advanced proprietary models like GPT-4. Specifically, GPT-4 demonstrated superior contextual understanding and reasoning, often outperforming other models in both accuracy and the logical progression of its questions.

The study concluded that word-guessing games could serve as an effective and accessible method for evaluating the intelligence of LLMs, particularly in tasks that require nuanced reasoning and the ability to infer information from limited data.
Study #2: Introducing the SpyGame Framework: A Multi-Agent Approach to Evaluating LLMs
Building on the concept of using games to assess LLMs, a second study, published in October 2023, introduced a much more complex evaluation framework known as "SpyGame". This study extended the traditional guessing game format by incorporating multiple LLM-based agents into a strategic word game called "Who is Spy?". In this game, LLMs not only had to guess a hidden word but also engage in strategic reasoning to deduce the roles of other agents and conceal their own identity.

The SpyGame framework tested broader cognitive abilities, including context retention, strategic reasoning, and decision-making, by simulating interactions between multiple agents in a competitive environment. The researchers found that while LLMs like GPT-4 performed well in direct word-guessing tasks, adding strategic elements introduced new challenges that exposed limitations in the models' ability to maintain coherence and consistency over extended interactions.
The study's findings underscored the importance of considering not just the reasoning abilities of LLMs in isolation, but also their performance in more dynamic and interactive contexts. The authors suggested that games like SpyGame could provide deeper insights into the cognitive capabilities of LLMs, particularly in scenarios that mimic real-world decision-making processes.
For those interested in a deeper dive into the methodologies and detailed findings of these studies, I encourage you to read the full papers. You can access them directly through the following links:
- [2310.01468] Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games (arxiv.org)
- [2310.20499] Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models (arxiv.org)
As these studies were conducted some time ago, several newer, more advanced LLMs were released meanwhile, including GPT-4o and GPT-4o-mini. I'm eager to see how they compare to GPT-4.
My Own Guessing Game for LLMs
For implementing my own application to evaluate how various LLMs can play a guessing game I have decided to test GPT models offered by OpenAI and available via their API.
I used OpenAI Python SDK and quickly drafted a simple app with the main game loop and communication with GPTs. Then came the interesting part: prompt engineering. With the first version of the prompt for both the guesser model (the one that asks questions and is being evaluated) and the judge model (the one that answers the guesser's questions) the results were not satisfactory. For example, the guesser model happened to ask open and repeating questions or multiple questions in a single response. And the judge model tended to indicate the correct guess even if the character's name did not appear in the question. After some experimentation, I ended up with these prompts:


Using these prompts I let the GPTs play the guessing game by asking up to 30 questions. The main game loop stops when the question limit is reached or when the judge model answers "Bravo", indicating that the guesser model has properly identified the character.
The application iterates over a predefined list of characters (which I crafted to include characters relatively easy to guess and some that I considered rather hard to guess) and a list of models to be evaluated. Each model has three approaches to guess each character.
The results of the game (incl. all questions and answers) are stored in a local database (I used SQLite for this), which can be then queried to get the overall statistics.
You can find the full source code of the app in this GitHub repo:
GitHub – krzysztof-zdeb/llm-guessing-game: Guessing Game for LLMs
Results of OpenAI GPTs Forehead Detective
OK, so let's see how various GPTs offered by OpenAI performed in the Forehead Detective game.
For the record: I relied on gpt-4o
as the judge model due to its high quality and acceptable price tag.
The table below presents detailed results of the experiment. For each character and GPT model, there are results of all three attempts (rounds) to guess the character. A number represents the question count that was asked to correctly guess the character. The "x" denotes that the character was not successfully identified in 30 questions.

The winner is the gpt-4o
model, which managed to correctly identify 10 out of 11 characters. It was also successful in 70% of all rounds. The second place belongs to thegpt-4
that guessed 9 characters and had 61% of successful rounds. The podium is closed by the gpt-4-turbo
with 8 characters recognized and 52% of successful rounds.
What's interesting is that none of the models were able to recognize the Wally (aka Waldo) character (from "Where's Wally?" puzzle book series by Martin Handford). Wally, being a character without distinct narrative traits and often defined by visual recognition rather than narrative characteristics, posed a unique challenge to the models.
On the other hand, all three best models managed to identify Zaphod Beeblebrox (from "The Hitchhiker's Guide to the Galaxy" by Douglas Adams). Here's the shortest sequence of questions asked by gpt-4
which led to the correct guess.
- Is the character from a fictional universe? Yes
- Is the character human? No
- Is the character from an animated series? No
- Is the character from a book? Yes
- Is the character an animal? No
- Is the character magical or supernatural? No
- Is the character from a science fiction genre? Yes
- Does the character have artificial intelligence? No
- Is the character an alien? Yes
- Is the character from a series written by Douglas Adams? Yes
- Is the character from "The Hitchhiker's Guide to the Galaxy"? Yes
- Is the character green? No
- Is the character known for his bad poetry? No
- Is the character the President of the Galaxy? Yes
- Is the character Zaphod Beeblebrox? Bravo
The sequence effectively narrows down possibilities by filtering out broad categories before focusing on specific traits. Starting with the character's fictional origin, the guesser model excludes humans, animals, and animated series characters, then pinpoints the science fiction genre and alien identity. The key moment comes when the model connects the character to Douglas Adams' work, quickly zeroing in on Zaphod Beeblebrox by confirming his role as the President of the Galaxy. Each question systematically reduces the scope, leading to a confident final guess. That's a decent reasoning process no human would be ashamed of.
Cost of GPTs Playing the Game
The experiment I took has its price tag attached related to using the GPTs via OpenAI API. The overall cost was 21,30 USD. The chart below shows how specific models participated in this cost.

It's important to remember that gpt-4o
, while being the most costly, was used not only as the guesser but also as the judge for all the turns. It consumed a total of 1,9M context tokens, while other models consumed between 260K to 420K (the differences here result from the models' quality -the better the model, the quicker it guessed a character; the fewer questions it asked, the fewer context tokens it consumed). So, considering that the majority of gpt-4o
cost was related to the judge role, it's also a winner in terms of cost-effectiveness.
Both gpt-3.5-turbo
and gpt-4o-mini
were the least expensive, but also got the worst results.
Final Thoughts
This little experiment with the "Forehead Detective" game turned out to be more insightful than I initially anticipated. It started as a fun way to engage my family during a long road trip, but quickly evolved into a deeper exploration of how GPT models handle reasoning tasks. The results were both impressive and humbling.
Seeing models like gpt-4o
navigate through questions and arrive at correct answers most of the time is a testament to how far AI has come. However, the hiccups – like failing to recognize Wally, a character that's more about visual identification than narrative – is a sign that we're still on a journey with these technologies. It's clear that while GPTs can simulate logical reasoning, they don't yet fully grasp the intricacies of human thought processes, especially when the task requires a blend of abstract thinking and common sense.
It's also important to consider the cost of using GPT models. The cheapest models turned out to be also the weakest in the game. At the same time, gpt-4o
while providing the best results, was also the most cost-effective of the top 3 models.
In the end, this experiment wasn't just about testing AI – to me it was a reminder of the incredible progress we've made with LLMs and the challenges that lie ahead. It's exciting to think about what the future holds as we continue to refine these models. And, on a more personal note, it was a great way to blend work and play, giving my family a glimpse into the world I'm so passionate about.