How 25,000 Computers Trained ChatGPT

Author:Murphy | View: 23723 | Time: 2025-03-23 12:51:24

What word comes after Good?

You might think Good Morning, or Good Bye. But you definitely wouldn't say Good Loud. That just doesn't make sense. For decades, computer scientists have been training AI to solve this exact problem.

Given a word, our AI predicts the next word. Do this several times, & you've generated a sentence.

This is how ChatGPT works.

Trained over the entire internet, ChatGPT has learned how to chat like a human. However, this immense feat was only made possible by a breakthrough in the late 2010s. A breakthrough underpinning ChatGPT & forever shaping the world we live in.

This is the story of an AI that read & learned from every book, tweet, & website across the entire internet. And how it was made possible.

Sentences are long.

When we move beyond a single word, next word prediction is a lot harder. Take this example.

In this context, it makes no sense to say I ate a good Morning. But our AI only looks at good, and spits out morning. In most cases even humans need many words to predict the next word. So an AI needs this extra information as well.

Our AI needs to read many words to predict the next word. ChatGPT can read more than 8,000 previous words at once. The natural way to do this would be to feed each word into the AI, one by one.

This is how AIs worked in the past. A Recurrent Neural Network (RNN) would take one word at a time, storing information as it reads a sentence.

One of the problems with this AI is that it's incredibly slow. Each word has to wait for the last, which is a problem at large scales. Imagine if your washing machine could only wash one shirt at a time. This sequential process would take days. But everyone knows we can throw in all our shirts at the same time, finishing in minutes. This is the idea of parallelism. By performing work in parallel instead of sequentially, we can dramatically speed up washing machines, computers, & AI.

RNNs could only be trained on millions of words, nowhere near the trillions across the internet. We needed a faster, more efficient way of reading sentences.

The Transformer was the solution.

In 2017, a paper titled Attention Is All You Need was published. This paper effectively turned sentences on their side. These researchers invented an AI that can read a whole sentence at once.

This new AI is called a Transformer, and its efficiency allowed it to learn from every book & website on the internet. To understand how it does this, we need to take a step back & understand how computers read text.

How can an AI read text?

Computers work in 1s & 0s. Called Binary, these 1s & 0s make up numbers. Computer scientists needed a way to represent words as numbers. And this was made possible in 2013, when scientists at Google created word2vec.

Words contain semantic meaning. Dogs are related to cats. Kings are related to queens. Word2vec was able to represent these semantics in vectors or lists of numbers.

With word2vec, you could take King, subtract Man, add Woman, and get the word Queen.

This vector of numbers is called a Word Embedding. They embed the word's meaning into this vector. When training an AI to process text, we actually feed it these word embeddings. The AI does some math, transforming these vectors, and spits out the next word. Transforming these vectors is what takes a lot of time.

The Transformer does this all in parallel.

Instead of waiting for the previous word to process, we transform all these word embeddings at the same time, performing an averaging to put them all together. This reduces the number of sequential operations from the length of the sentence, to a constant number.

Lambda Labs estimated that training ChatGPT on a single GPU would take 355 years. But by exploiting its parallelism, ChatGPT was trained across 25,000 GPUs & finished in a matter of days.

The Transformer sparked a paradigm shift in AI

With increased parallelism, larger and larger AIs could be trained. While the largest sequential models in the past were trained on millions of words, ChatGPT was trained on nearly a trillion.

Image by Author, Data from original papers

ChatGPT was trained on CommonCrawl, a collection of the entire internet since 2008. Using over 25,000 computers, this model has read & learned from every website over the entire internet. Imagine reading every book, tweet, & piece of code ever published.

Today, ChatGPT is being used to write code, generate TV commercials, & assist you with almost anything you can imagine! By turning sentences on their side, we've created a new era in AI, one that pushes the boundaries of what was once thought possible.

But we may have reached a limit.

After the release of GPT-4, Sam Altman, OpenAI's CEO said,

"I think we're at the end of the era where it's going to be these, like, giant, giant models…"

After learning from the entire internet, what comes next? The impact of ChatGPT is trickling down into every industry. But like any breakthrough, progress plateaus. And AI's next inflection point, only time will tell.