LLMs Are Dumber Than a House Cat

Author:Murphy  |  View: 27844  |  Time: 2025-03-22 23:25:53

Frustration? Confusion? Perhaps "lack of elegance" is a better expression. It's the pain you feel when you witness a top-tier scientist marvel at technology they already understand.

AI influencers play the amazement card to gain clicks, but for scientists and engineers, it's a different story. Magic is supposed to fade as soon as you uncover the trick.

That's why it stings to see researchers at Microsoft and beyond using words like "impossible," "insane," and "astonishing" to describe GPT-4 months after its release.

Not to pick on Sebastian Bubeck in particular, but if auto-complete-on-steroid can "blow his mind," imagine the effects on the average user.

Developers and data practitioners use LLMs every day to generate code, synthetic data, and documentation. They too can be misled by inflated capabilities. It's when humans over-trust their tools that mistakes happen.

TL;DR: This is an anti-hype take where you'll understand how LLMs work, why they're dumb, and why they're very useful anyway – especially with a human in the loop.


The busy person's intro to LLMs

If an LLM was a folder, it would have two files: the first is code you can execute and the second is a CSV (a large table) filled with numbers.

  1. The code defines the structure of the neural network of your model and the instructions necessary to run it. It's like telling your computer how to organize itself to perform a certain type of calculations.
  2. The CSV file is a large list of numbers, called weights. These weights determine how the neurons inside your artificial neural network ( neuro-net) behave.

Think of a neuro-net as a chef trying to perfect a recipe. Each ingredient (input) can drastically change the flavor of the dish (output).

The weights of a neuro-net represent the precise measurements for each ingredient. Just as a chef adjusts each ingredient's amount to improve the taste, the neuro-net tweaks the weights of each input to get the desired outcome.

Over time and repetitions (training), the chef learns the right balance of flavors – and so does the neural network. It learns the optimal weights to make accurate predictions or decisions.

Each successful recipe, refined through trial and error, is recorded with exact measurements. That's your CSV file. That's your collection of weights. And just as it costs time and resources to train a masterful chef, weights are expensive to get.

You have to inject massive amounts of data into your model and let it train for days on end. You also need specialized computers, called GPUs, to run multiple calculations at the same time (parallel processing).

For instance, Llama2 70B from Meta was trained for 12 days using 6,000 GPUs, reaching a cost of $2M. Yes, that's only to get the weights.

Once you pay millions of dollars to get your "recipes," you can reuse them indefinitely, at the cost of pennies. Every time you apply a recipe to a bunch of ingredients you perform what we call an "inference."

These "recipes" are a bit more complex than a chef's, however. They include thousands of scientific texts, fiction books, and blog posts. Almost any sequence of words published online – including nonsense – goes into an LLM's training data.

At this point you have a "pre-trained" model that can't answer your questions yet. What you get instead is "predict the next token." You give your model a series of words and it dreams a possible continuation.

For instance, you say "The answer to the ultimate question of life, the universe, and everything is…" and the model says "42."

Now, if you ask your pre-trained model: "What's the capital of France?" it'll likely say "What's the capital of Spain?" because it's a pattern it has seen in thousands of quizzes published online.

If you want your model to answer questions, you need to add extra steps.

  • Fine-tuning: you curate a list of questions (Q) and appropriate answers (A) and you feed these Q/A couples to your model. The model then learns to respond to questions based on the examples you fed it.
  • Guidelines (more fine-tuning): in this step, you add guardrails for safety, increase accuracy, and adjust the tone. Guideline techniques involve further fine-tuning, scoring (RLHF), and writing core prompts.

What's impressive is you conserve the "dreaming" capability of your pre-trained model and add a question-answering capability on top of it.

Once you're done with training and several fine-tuning steps, you get something like ChatGPT Classic – a chatbot that can answer your questions and generate all kinds of output.

What a lot of people don't get, however, is your now-very-useful assistant LLM is still dreaming up every single answer.


LLMs don't think, they hallucinate 24/7

When people say "LLMs hallucinate" they usually mean "LLMs make factual errors." This interpretation is a few miles off target.

"I always struggle a bit [when] I'm asked about the ‘hallucination problem' in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines." – " Andrej Karpathy, OpenAI founding member.

LLMs are like freestyle rappers. They don't care much about accuracy. Their goal is to generate a plausible answer based on the prompt you give them. Much like a rapper who makes up lyrics on the fly, LLMs predict one token at a time -all while trying to remain grounded within the context.

Sure, LLMs draw from knowledge acquired during their training phase, but they don't reason before writing a response. Accuracy is merely a positive "side effect" that derives from a clever approach.

Say you're building an LLM. Your ultimate goal is to generate factual information. If you take all of the content written by humans and compress it into "recipes of knowledge," you should be able to land on facts when you try to predict the next word, right?

In a sense, you're making a bet that most of the training data is facts. You then increase your chances of success through fine-tuning and safety guidelines. Facts in; facts out.

Still, your LLM itself doesn't reason about what's true or false. It's merely predicting the most likely words based on language patterns seen before.

This is why people call LLMs bullshit machines. Not in the slang meaning of the term, but rather, the philosophical one.

Philosopher Harry Frankfurt described bullshit as information that's disconnected from reality. When you lie, you distort reality. When you tell the truth, you describe your representation of reality. But when you bullshit, you make things up with no consideration of reality (or the truth).

It is just this lack of connection to a concern with truth – this indifference to how things really are – that I regard as of the essence of bullshit.

This points to a similar and fundamental aspect of the essential nature of bullshit: although it is produced without concern with the truth, it need not be false.

_— On Bullshit by Harry Frankfurt. [Emphasis by the author]._

"But is it really necessary to reason?" you may ask. "If we have a lot of clean data, predicting the next token should get us to a creative fact-spitting machine, and perhaps even AGI…no?"

Hold that thought.


Lost in compression

Ilya Sutskever, one of OpenAI's cofounders, argued that: "Predicting the next token, well, means that you understand the underlying reality that led to the creation of that token."

Ilya is one of the smartest people alive, but he's not immune to logical fallacies.

Human language is a compressed version of reality, but it's a lossy compression. You lose information when you compress a description of reality into a series of words.

When you compress an image or an Excel table, you get a zip file. If you right-click and extract the content, you retrieve 100% of the information you compressed earlier.

With human language, it's different. When someone tells you "Imagine a purple elephant flying across an orange sea," they're compressing a fictional scene into nine words.

Chances are you just decompressed the previous sentence into a short video you played inside your head. Appreciate how you get the general idea, but some (crucial) information is lost in compression.

You don't know the exact size of the elephant. You don't know which shade of purple it has and whether it's a biological elephant or an artificial one. Besides, what makes the sea orange? What kind of waves does it have?

Sure, you can add words to compensate for the missing information – colors, textures, and speed, to name a few. But for these extra descriptions to make sense, you need to be able to simulate their meaning.

What if the elephant is made of hepatizon and travels at Mach-3? If you don't have a model of "hepatizon" and "Mach-3," your decompression fails.

Hepatizon is an ancient purple metal and Mach-3 is three times the speed of sound.

You encode reality using human language but you need more than language to decode it. Think symbols, logic, mental simulations, and an understanding of the laws of physics.

Now let's add LLMs to the picture.

LLMs are a compression of human language – and it's also a lossy compression. You compress twice, you lose information twice. Meaning you get even further from an accurate representation of reality.

The same logic applies to alternative realities where metallic purple elephants fly at subsonic speeds. If you don't have a grasp of the laws of physics that govern reality, you can't simulate hypothetical ones.

Let's illustrate with an example.

The Sanjok riddle

My friend, who's about 33 feet (10 meters) away from me, very playfully, gently, and slowly throws a Sanjok at me.

A Sanjok is a pillow-like object made of a special kind of steel: a state-shifting steel.
The state-shifting ability activates only when the Sanjok is traveling through the air.

Every second, the steel switches back and forth from being as light as a bag of feathers to a state where it's as heavy as a giant boulder.
This means the total weight of the Sanjok can vary from 1 pound (0.45 kg) to 5,000 pounds (2 268 kgs) -- and vice versa.

Who's in danger? What should I do?

Take a minute. You may need a piece of paper too.

Ready for the answer?

Let's first examine that of ChatGPT-4:

Here's the sequence of events based on mechanical physics and simple heuristics.

  • Your friend throws the Sanjok. It's a slow, gentle throw that travels at around 10 feet (3 meters) per second.
  • During the first second, the Sanjok is in "light mode" and weighs one pound (0.45 kilograms).
  • After one second in the air, the Sanjok switches to "heavy mode." The Sanjok now weighs 5,000 lbs (2 268 kgs.).
  • Let's "pause" time here.
  • The Sanjok is now suspended mid-air at 6 ft (2 m) of height.
  • It's at a distance of 10 ft (3 m) away from your friend and 23 ft (7 m) away from you.
  • The heavy weight of the Sanjok cancels the initial velocity from your friend's weak throw.
  • The Sanjok is about to begin a free fall from 6 ft (2 m) of height.
  • Now press "play."
  • It takes the 5,000-pound Sanjok about 0.64 seconds to hit the ground. Newton and Galileo can testify to that. Given the weight of the Sanjok, you can neglect all external forces (like air resistance) except for gravity.
  • The Sanjok hits the ground after 0.64 seconds. It lands at around 6 ft (3 m) from your friend and 23 ft (7 m) from you, give or take a few inches.

Conclusion: No one is in danger if you're playing this game in a free field. No need to hide or seek cover. But if you're playing Sanjok on a wooden rooftop, it's a different story.

You can solve the Sanjok riddle because your model of reality includes the laws of physics. LLMs struggle because their model of reality is 100% abstract language patterns – and nothing else (for now).

If you want an LLM to give you the right answer, you have to either:

  1. Break the Sanjok riddle into several steps. Make each step resemble other riddles your LLM has seen in the training data.
  2. Feed several (hundreds) variants of the Sanjok riddle into the training/fine-tuning data.
  3. Write the answer in the prompt.

You can still surpass human output using next-token prediction in many areas. Every LLM is better than me at writing Japanese poems.

But there's no way predicting the next token means your LLM understands the reality that led to the creation of that same token.

You only need to know what words came before and after in the training data, regardless of what they mean. It's the reason why LLMs struggle with simple physics riddles they've never seen before.

It's why Ilya's argument doesn't hold up under closer scrutiny.


LLMs are great actors though

LLMs are exceptional at faking knowledge thanks to their eloquence. Even when they generate false statements, they use coherent and elegant formulations, which makes it hard for non-experts to distinguish facts from made-up nonsense.

"We're easily fooled by those systems into thinking they are intelligent just because they manipulate language fluently.

The only example that we have of an entity that can manipulate language is other humans so when we see something that can manipulate language flexibly we assume that entity will have the same type of intelligence as humans but it's just not true.

Those systems are incredibly stupid.

Partly they are stupid because they are only trained on language and most of human knowledge has nothing to do with language.

To some extent the smartest AI systems today have less understanding of the physical world than your house cat."

The "human knowledge that has nothing to do with language" is all the information lost in compression. It's math, reasoning, planning, and laws of physics, among other things. These knowledge gaps become apparent when you consider practical scenarios.

For instance, a startup called Patronus AI tested GPT-4 on a series of finance tasks. The most capable model available in 2023 achieved a score of 79% – a figure that, while impressive, remains insufficient given the high stakes of the tasks.

"That type of performance rate is just absolutely unacceptable," Patronus AI co-founder Anand Kannappan said. "It has to be much much higher for it to really work in an automated and production-ready way."

It's no surprise many AI experts believe we need further innovation to unlock more capabilities. Scaling up LLMs has potential, but it won't bridge all the existing gaps, let alone reach Artificial General Intelligence.

"I think we need another breakthrough. We can push on large language models quite a lot, and we should, and we will do that," OpenAI CEO Sam Altman said. "We can take our current hill that we're on and keep climbing it, and the peak of that is still pretty far away."

"But within reason, if you push that super far, maybe all this other stuff emerged," he added. "But within reason, I don't think that will do something that I view as critical to a general intelligence."

Does this mean LLMs are useless in the meantime?


Humans + LLMs + tools = super powers

Patronus AI described GPT-4's performance as "unacceptable" in the context of automating 100% of a given task.

Another way to see the results is: LLMs can handle the boring 79%, while human operators focus on the critical 21%. Put differently, your workload becomes smaller and more stimulating at the same time.

Similar trends have been observed in other studies where developers, data practitioners, and business consultants became twice as fast in certain tasks when using LLMs. Output quality also increased.

You'll see a lot of "Can LLMs replace a Data Scientist?" and "Can LLMs replace Developers?" For now, the answer to these questions is: No. LLMs won't replace you, but people who use LLMs will.

Plus, it's not exactly LLMs we're talking about. It's "LLMs + tools."

LLMs alone are dream machines. Equip them with code interpreters, web browsers, and image generators, and they become AI assistants. Picture the difference between ChatGPT when it first came out versus ChatGPT now.

With such AI assistants, you no longer have to start every task from scratch. What you have to do, however, is verify the output of your prompts.

The more we'll use AI assistants, the more we'll need verification. That's your main role as the human in the loop.

"[LLMs] just can't do their own planning/reasoning with any guarantees," AI researcher Subbarao Kambhampati said. "So they are best used in LLM-Modulo settings (with either a sound reasoner or an expert human in the loop)."

There are two complementary ways this scenario plays out:

  • Augment humans with AI assistants: Humans become artisans who combine information and AI outputs to produce results.
  • Delegate tasks to AI assistants: Humans become managers and supervisors who commission, verify, and correct AI outputs.

Why are you still here?

LLMs are dumb, but they make you smarter. Faster. More resourceful. They're a bridge between you and computing capabilities – a bridge made of natural language.

Given the right combination of "LLM + tools," you put yourself one prompt away from solving any problem, or at least working towards a solution.

"That's the kind of revolution we're about to see, folks. Not job replacement by machines but rather an unprecedented spike in individual productivity, which raises opportunities and problems for society as a whole." – Fromer Google Chief Decision Scientist, Cassie Kozyrkov.

The spike in productivity won't happen on its own however. You need to get your hands dirty; you need to hit the keyboard. You want to write prompts, design AI assistants, and get into the habit of verification.

Most of these tasks are done in natural language. But as with humans, AI models won't read your mind just because you're talking in plain Enlish.

You want to learn how to write clear instructions, combine them with code, and experiment with different models. Speaking of which, here are four resources to help you get started:

Technology continues to change what kind of work we do and how we do it. Those who embrace it quickly, get a headstart in the adaptation game.

The question is: Why are you still reading about LLMs instead of playing with them?


Keep in touch?

You can become a Medium member to support me with a tiny commission. You can also subscribe to get email notifications. Smiling also works.

Join Medium with my referral link – Nabil Alouani

I'm also active on Linkedin and X and reply to every single message.

For Prompt Engineering inquiries, write me at: [email protected].

Tags: Artificial Intelligence Data Science Deep Dives Machine Learning Programming

Comment