BERT vs GPT: Comparing the NLP Giants

Author:Murphy | View: 25644 | Time: 2025-03-23 12:59:39

Image generated by the author using Stable Diffusion.

In 2018, NLP researchers were all amazed by the BERT paper [1]. The approach was simple, yet the result was impressive: it set new benchmarks for 11 NLP tasks.

In a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analysing and improving the model. [2]

In 2022, ChatGPT [3] blew up the whole Internet with its ability to generate human-like responses. The model can comprehend a wide range of topics and carry the conversation naturally for an extended period, which sets it apart from all traditional chatbots.

BERT and ChatGPT are significant breakthroughs in NLP, yet their approaches are different. How do their structures differ, and how do they impact the models' ability? Let's dive in!

Attention

We must first recall the commonly-used attention to understand the model structure fully. Attention mechanisms are designed to capture and model relationships between tokens in a sequence, which is one of the reasons why they have been so successful in NLP tasks.

An intuitive understanding

Imagine you have n goods stored in boxes _v1, v2,…,v_n. These are ****_ called "values".
We have query q which **** demands to take some suitable amount **** w of goods from each box. Let's call the_m w_1, w2,..,w_n (this is the "attention weight")
How to determine _w_1, w_2,.., w_n? Or, in other words, how to know among v_1,v_2, ..,v_n,_ which should be taken more than others?
Remember, all the values are stored in boxes we cannot peek into. So we can't directly judge _v_i_ should be taken less or more.
Luckily, we have a tag on each box, _k_1, k_2,…,k_n_, which are called "keys". The "keys" represent the characteristic of what is inside the containers.
Based on the "similarity" of q and _k_i (qki)**, we can then decide how important the _v_i is (w_i) and how much of v_i we should take(**w_iv_i_).

Base attention mechanism (Image by the author)

Of course, that is a very abstract explanation of attention, but it helps me to remember better the meaning behind "query", "key", and "value".

Next, let's take a deeper look at how the Transformer models use different types of attention.

BERT: Global self-attention and bidirectional encoder

Global self-attention has the same value for query, key and value. In a sequence of tokens, each token will "attend" all other tokens, so the information is propagated along the sequence. And more important, in parallel.

This is significant compared with RNN and CNN.

For RNN, each "state" is passed through many steps, which may cause the loss of information. Besides, RNN passing is sequentially through each token; we can't make use of GPU parallelism.
For CNN, even though it runs in parallel, each token can only attend to a limited field, making assumptions about the tokens' relationship.

The self-attention is the key component of encoders, the building block of BERT [1]. The BERT paper's authors pointed out the limits of left-to-right language models as follows.

Such restrictions are sub-optimal for sentence-level tasks and could be very harmful when applying finetuning-based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions. [1]

To overcome the shortcoming above, BERT was pre-trained on "masked language model" (MLM) and "next sentence prediction" (NSP) tasks.

For the MLM task, 15% of token positions were selected for prediction. So those chosen tokens will have 80% replaced with a [MASK] token, 10% replaced by a random token, and 10% not replaced.
For the NSP task, given 2 sentences, s1 and s2, the input format is "[CLS][SEP]", and the model predicts whether s1 is followed by s2. [CLS] and [SEP] are the special classification and separate tokens, respectively.

As we can see, the model can "peek" at both the left and right contexts of each token in both tasks. **** This allows the model to take advantage of the bidirectional word representation and gain a deeper understanding.

But the bidirectional encoding comes with a cost. Lacking decoders, BERT may not be suitable for text generation. Therefore, the model requires adding extra task-specific architecture to adapt to generative tasks.

GPT: Causal self-attention and text generative

Compared to global self-attention, causal self-attention allows each token to only attend to its left context. This architecture is unsuitable for tasks such as textual understanding but makes the model good at text generation.

Namely, causal self-attention allows the model to **** learn the probabilities of a series of words, which is the core of a "language model" [8]. Given a sequence of symbols x=(s1, s2, …, sn), the model can predict the likelihood of the series as follows.

Joint probabilities over a sequence of symbols [6]

Causal self-attention is the critical component of the Transformer decoder block. One of the first pre-trained Transformer decoders is GPT [5] by OpenAI. Like BERT, the model also aims to utilise the massive corpus of unlabeled text datasets to build a pre-trained language model. Pretraining on Book Corpus[7], the model objective is to predict the next token. The pre-trained model is then finetuned to adapt to downstream tasks.

GPT-2 [6] shares the same approach of building universal word representation but is more ambitious. It aims to be a "multitask learner", performing different tasks without fine-tuning. GPT only learns the distribution of p(output|input), which makes the model lack context on "what task to do". The authors wanted to adapt GPT-2 to multi-tasks by conditioning the prediction on both input and task, p(output|input, task).

Previous approaches have corporated the "task" information at the architectural level, but GPT-2 makes it more flexible by "expressing" the task through natural language. For example, a translation task's input can be "translate to French, ".

Mining a large amount of unlabeled text with explicit "task" information can be challenging. However, the authors believed the model could infer the implicit "tasks" expression from natural languages. Therefore, they collected a vast and diverse dataset which can demonstrate the "task" in varied domains. Namely, the model was trained on the WebText dataset[6] containing the text subset of 45 million links.

Despite the less competitive performance on some benchmarks, GPT-2 has laid the ground for many LLM laters, such as GPT-3 [9] and ChatGPT. In particular, GPT-3 can comprehend tasks and demonstrations solely through text-based interactions. For the SuperGLUE benchmark [10], a set of language understanding tasks, GPT-3, without gradient-based update, has shown impressive performance compared to fine-tuned BERT.

Performance of GPT-3 and BERT on SuperGLUE [9]

Which model to choose?

Based on the models' structure, we can conclude that BERT excels at understanding language and extracting contextual information, making it ideal for tasks like sentiment analysis and text classification. In contrast, GPT models are designed for generating human-like text, making it a top choice for chatbots and language generation tasks.

Another important factor is our data resources. We can easily customise recent GPT models to specific tasks with only a small amount of data, making them suitable for a broader range of applications. On the other hand, BERT finetuning might require more effort and data. For finetuning LLM techniques, you can check out my post.

A Quick Guide to Fine-tuning Techniques for Large Language Models

Last but not least, we also need to consider our computational resources. Although there have been many optimisation efforts, finetuning, storing and serving LLM still demands substantial resources compared to BERT.

Or you may enjoy the best of both worlds by incorporating them together. I will cover this topic in a future article.

For now, I hope you enjoy the reading

Tags: Artificial Intelligence Data Science Deep Learning Machine Learning NLP