Practical Introduction to Transformer Models: BERT
Preface: This article presents a summary of information about the given topic. It should not be considered original research. The information and code included in this article may have been influenced by things I have read or seen in the past from various online articles, research papers, books, and open-source code.
Table of contents
- Introduction to BERT
- Pre-training and fine-tuning
- Hands-on: Using BERT for sentiment analysis
- Interpreting results
- Closing thoughts
In NLP, the transformer model architecture has been a revolutionary model architecture that greatly enhanced the ability to understand and generate textual information.
In this tutorial, we are going to dig deep into Bert, a well-known transformer-based model, and provide an hands-on example to fine-tune the base BERT model for sentiment analysis.
Introduction to BERT
BERT, introduced by researchers at Google in 2018, is a powerful language model that uses transformer architecture. Pushing the boundaries of earlier model architecture, such as LSTM and GRU, that were either unidirectional or sequentially bi-directional, BERT considers context from both past and future simultaneously. This is due to the innovative “attention mechanism,” which allows the model to weigh the importance of words in a sentence when generating representations.
The BERT model is pre-trained on the following two NLP tasks:
- Masked Language Model (MLM)
- Next Sentence Prediction (NSP)
and is generally used as the base model for various downstream NLP tasks, such as sentiment analysis, which we will cover in this tutorial.
Pre-training and fine-tuning
The power of BERT comes from its two-step process:
- Pre-training is the phase where BERT is trained on large amounts of data. As a result, it learns to predict masked words in a sentence (MLM task) and to predict if a sentence follows another one (NSP task). The output of this stage is a pre-trained NLP model with a general-purpose “understanding” of the language
- Fine-tuning is where the pre-trained BERT model is further trained on a specific task. The model is initialized with the pre-trained parameters, and the entire model is trained on a downstream task, allowing BERT to fine-tune its understanding of language to the specifics of the task at hand.
Hands on: Using BERT for sentiment analysis
The complete code is available as a Jupyter Notebook on GitHub
In this hands-on exercise, we will train the sentiment analysis model on the IMDB movie reviews dataset [4] (license: Apache 2.0), which comes labeled whether a review is positive or negative. We will also load the model using the Hugging Face’s transformers library.
Let’s load all the libraries:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
# Variables to set the number of epochs and samples
num_epochs = 10
num_samples = 100 # set this to -1 to use all data
First, we need to load the dataset and the model tokenizer.
# Step 1: Load dataset and model tokenizer
dataset = load_dataset('imdb')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
Next, we’ll create a plot to see the distribution of the positive and negative classes.
# Data Exploration
train_df = pd.DataFrame(dataset["train"])
sns.countplot(x='label', data=train_df)
plt.title('Class distribution')
plt.show()

Next, we preprocess our dataset by tokenizing the texts. We use BERT’s tokenizer, which will convert the text into tokens that correspond to BERT’s vocabulary.
# Step 2: Preprocess the dataset
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

After that, we prepare our training and evaluation datasets. Remember, if you want to use all the data, you can set the num_samples
variable to -1
.
if num_samples == -1:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)
else:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(num_samples))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(num_samples))
Then, we load the pre-trained BERT model. We’ll use the AutoModelForSequenceClassification
class, a BERT model designed for classification tasks.
For this tutorial, we use the ‘bert-base-uncased’ version of BERT, which is trained on lower-case English text, is used for this tutorial.
# Step 3: Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
Now, we’re ready to define our training arguments and create a Trainer
instance to train our model.
# Step 4: Define training arguments
training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch", no_cuda=True, num_train_epochs=num_epochs)
# Step 5: Create Trainer instance and train
trainer = Trainer(
model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
trainer.train()
Interpreting results
Having trained our model, let’s evaluate it. We’ll calculate the confusion matrix and the ROC curve to understand how well our model performs.
# Step 6: Evaluation
predictions = trainer.predict(small_eval_dataset)
# Confusion matrix
cm = confusion_matrix(small_eval_dataset['label'], predictions.predictions.argmax(-1))
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()
# ROC Curve
fpr, tpr, _ = roc_curve(small_eval_dataset['label'], predictions.predictions[:, 1])
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(1.618 * 5, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()


The confusion matrix gives a detailed breakdown of how our predictions measure up to the actual labels, while the ROC curve shows us the trade-off between the true positive rate (sensitivity) and the false positive rate (1 — specificity) at various threshold settings.
Finally, to see our model in action, let’s use it to infer the sentiment of a sample text.
# Step 7: Inference on a new sample
sample_text = "This is a fantastic movie. I really enjoyed it."
sample_inputs = tokenizer(sample_text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
# Move inputs to device (if GPU available)
sample_inputs.to(training_args.device)
# Make prediction
predictions = model(**sample_inputs)
predicted_class = predictions.logits.argmax(-1).item()
if predicted_class == 1:
print("Positive sentiment")
else:
print("Negative sentiment")

Closing thoughts
By walking through an example of sentiment analysis on IMDb movie reviews, I hope you’ve gained a clear understanding of how to apply BERT to real-world NLP problems. The Python code I’ve included here can be adjusted and extended to tackle different tasks and datasets, paving the way for even more sophisticated and accurate language models.
References
[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
[3] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., … & Rush, A. M. (2019). Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
[4] Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur, A., von Platen, P., Patil, S., Chaumond, J., Drame, M., Plu, J., Tunstall, L., Davison, J., Šaško, M., Chhablani, G., Malik, B., Brandeis, S., Le Scao, T., Sanh, V., Xu, C., Patry, N., McMillan-Major, A., Schmid, P., Gugger, S., Delangue, C., Matussière, T., Debut, L., Bekman, S., Cistac, P., Goehringer, T., Mustar, V., Lagunas, F., Rush, A., & Wolf, T. (2021). Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 175–184). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-demo.21
Thanks for reading. If you have any feedback, please feel to reach out by messaging me on LinkedIn or shooting me an email (smhkapadia[at]gmail.com).