Semantic Textual Similarity with BERT

Author:Murphy | View: 23309 | Time: 2025-03-23 19:46:48

Photo by Leeloo Thefirst: https://www.pexels.com/photo/brown-wooden-ruler-and-colored-pencils-on-papers-8970296/

Ever since its inception in 2017 by Google Brain team, Transformers have rapidly become the state-of-the-art model for various use cases within the fields of Computer Vision and NLP. Its superior performance led to the development of several state-of-the-art models such as Bert and its variants like distilBERT and RoBERTa.

BERT outperformed old recurrent models in various NLP tasks such as text classification, Named Entity Recognition (NER), question answering, and even the task that we're going to focus on in this article, which is semantic textual similarity (STS).

Thus, in this article, we're going to see how we can train a BERT model for STS task with the help of Sentence Transformers library. Next, we're going to use the trained model to predict unknown data. But as a starter, we need to know first what STS task actually is and the dataset that we will use for this task.

Semantic Textual Similarity and the Dataset

Semantic textual Similarity (STS) refers to a task in which we compare the similarity between one text to another.

The output that we get from a model for STS task is usually a floating number indicating the similarity between two texts being compared.

There are several ways to quantify the similarity between a pair of texts. Let's take a look at the dataset that we're going to use in this article as an example, which is the STSB dataset (licensed under CC-Share Alike 4.0).

!pip install datasets

from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")

print(dataset[0])
>>> {'sentence1': 'A plane is taking off.',
 'An air plane is taking off.',
 'similarity_score': 5.0}

print(dataset[1])
>>> {'sentence1': 'A man is playing a large flute.',
 'sentence2': 'A man is playing a flute.',
 'similarity_score': 3.799999952316284}

The similarity between a pair of texts is labeled between a number from 1 to 5; 1 if a pair of texts is completely dissimilar, and 5 if a pair of texts is exactly similar in terms of their semantic meaning.

However, there is a catch. When we want to train a BERT model with the help of Sentence Transformers library, we need to normalize the similarity score such that it has a range between 0 to 1. This can be achieved simply by dividing each similarity score by 5.

similarity = [i['similarity_score'] for i in dataset]
normalized_similarity = [i/5.0 for i in similarity]

Now that we know the dataset that we will be working with, let's now proceed to the model that we're going to use in this article.

How Transformers-Based Model Measure Similarity Between a Pair of Texts

Transformers-based models such as BERT, distilBERT, or RoBERTa expect a sequence of tokens as input. Thus, the very first step that should be done is to convert our input text into a sequence of tokens. This process is called tokenization.

The tokenization process for BERT models consists of two steps. First, our input text will be split into several small chunks called tokens; one token can be a word or a sub-word. Second, two special tokens are added to our sequence of tokens: one at the beginning and one at the end. These two special tokens are:

[CLS]: this is the first token in each sequence of token
[SEP]: this token is important to give BERT a hint about which token belongs to which sequence. If there is only one sequence of tokens, then this token will be the last token in the sequence

Depending on the maximum sequence length of the tokenizer that you define in advance, a bunch of [PAD] tokens will also be appended after the [SEP] token.

The tokenized input then will be passed into the model and as the output, we will get the embedding vector of each token. Each embedding vector has 768 dimensions.

If we use BERT for classification purposes, then normally we take the embedding vector of the [CLS] token and pass it to a softmax or sigmoid layer in the end that will act as a classifier.

If we use BERT for STS task, the workflow would be something like this:

With the workflow shown above, BERT achieved state-of-the-art performance on the STS benchmark. However, there is one major drawback to that workflow: the scalability factor.

Imagine we have a brand new text. Next, we want to query the most similar entry to this new text in our database which consists of 100K different texts. If we use BERT architecture as above, then we need to compare our new text with each entry in our database 100K times. This means 100K times of tokenization process and forward pass.

The main problem of this scalability factor is the fact that BERT outputs the embedding vector of each token and not the embedding vector of each text/sentence.

If BERT somehow can give us a meaningful sentence-level embedding, then we can save the embedding of each entry in our database. Once we have a new text, then we only need to compare the sentence embedding of our new text with each entry's sentence embedding in our database with the help of cosine similarity, which is a way faster method.

This is what Sentence BERT (SBERT) tries to tackle. You can view SBERT as a fine-tuned version of BERT by applying siamese-type model architecture, as you can see below:

The problem with the architecture above is that it still generates token-level embedding. Thus, SBERT implements an additional pooling layer on top of BERT. There are three different pooling strategies implemented by SBERT:

Using the embedding of [CLS] token
Using the mean of all token-level embedding vectors (this is the default implementation)
Using the max-over-time token-level embedding vectors

The illustration above is the final architecture of SBERT model. What we get after the pooling layer is the embedding vector of a text that has 768 dimensions. This embedding then can be compared to each other with pairwise distance or cosine similarity, which is exactly what STS task is all about.

To implement SBERT, we can use sentence-transformers library. If you haven't installed it yet, you can do so via pip:

!pip install sentence-transformers

Now we are going to implement SBERT model based on BERT, but you can also implement SBERT with BERT variants like distilBERT or RoBERTa, or even load a model that has been pretrained on particular dataset. You can find all of the available models here.

from sentence_transformers import SentenceTransformer, models

word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=128)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
sts_bert_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

From the code snippet above, we first load a BERT model as our word embedding model, and then we apply a pooling layer on top of the BERT model to obtain the sentence-level embedding in the end.

Let's say that we have a pair of sentences and we want to fetch the sentence-level embedding of each sentence. We can do so by doing the following:

!pip install transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence_1 = [i['sentence1'] for i in dataset]
sentence_2 = [i['sentence2'] for i in dataset]
text_cat = [[str(x), str(y)] for x,y in zip(sentence_1, sentence_2)][0]

input_data = tokenizer(text_cat, padding='max_length', max_length = 128, truncation=True, return_tensors="pt")
output = sts_bert_model(input_data)

print(output['sentence_embedding'][0].size())
>>> torch.Size([768])

print(output['sentence_embedding'][1].size())
>>> torch.Size([768])

Semantic Textual Similarity Implementation

In this section, we're going to train an SBERT model on the dataset that we've discussed in the previous section for STS task.

Model Architecture Definition

Let's define the model architecture first.

import torch

class STSBertModel(torch.nn.Module):

    def __init__(self):

        super(STSBertModel, self).__init__()

        word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=128)
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
        self.sts_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

    def forward(self, input_data):

        output = self.sts_model(input_data)

        return output

The model architecture above is similar to what we've seen in the previous section. We use a BERT base model as our word embedding model. The output of this model is still a token-level embedding. Thus, we need to add a pooling layer on top of it.

The final output that we get from our SBERT model above is 768 dimensions of sentence-level embedding vector. Since the input of our model is a pair of texts, then the output will also be a pair of 768 dimensions of sentence-level embedding vector.

Data Loader

Data loader is necessary to create batch on our dataset. This process is important because we can't just feed our model with the whole dataset at once during the training process.

class DataSequence(torch.utils.data.Dataset):

    def __init__(self, dataset):

        similarity = [i['similarity_score'] for i in dataset]
        self.label = [i/5.0 for i in similarity]
        self.sentence_1 = [i['sentence1'] for i in dataset]
        self.sentence_2 = [i['sentence2'] for i in dataset]
        self.text_cat = [[str(x), str(y)] for x,y in zip(self.sentence_1, self.sentence_2)]

    def __len__(self):

        return len(self.text_cat)

    def get_batch_labels(self, idx):

        return torch.tensor(self.label[idx])

    def get_batch_texts(self, idx):

        return tokenizer(self.text_cat[idx], padding='max_length', max_length = 128, truncation=True, return_tensors="pt")

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

def collate_fn(texts):

  num_texts = len(texts['input_ids'])
  features = list()
  for i in range(num_texts):
      features.append({'input_ids':texts['input_ids'][i], 'attention_mask':texts['attention_mask'][i]})

  return features

We've seen in the above sections what our dataset looks like and how to prepare it so that it can be used by our model for STS task. The code above does exactly the same:

The similarity score between each pair of texts is normalized, and this will be our ground truth label for model training
Each pair of texts is tokenized with the exact same tokenizer and the exact same step that we saw in the previous section. The tokenized pair of texts will be the input of our model during training.

The collate_fn above is an important function to group each pair of texts together after the tokenization process for batching purposes.

Loss Function

In an STS task, our goal is to train a model such that it can distinguish between similar and dissimilar pairs of texts in terms of their semantic meaning. This means that we want the model to push the distance of dissimilar pairs of texts far apart, whilst keeping the distance of similar ones close to each other.

There are a few common loss functions that we can use to achieve this objective: cosine similarity loss, triplet loss, and contrastive loss.

Normally we can use contrastive loss for this case. However, contrastive loss expects our label to be binary, i.e the label is 1 if the pair is semantically similar, and 0 otherwise. Meanwhile, what we have as the label on this dataset is a floating number that ranges between 0 to 1, thus cosine similarity loss would be a better loss function to implement.

class CosineSimilarityLoss(torch.nn.Module):

    def __init__(self,  loss_fct = torch.nn.MSELoss(), cos_score_transformation=torch.nn.Identity()):

        super(CosineSimilarityLoss, self).__init__()
        self.loss_fct = loss_fct
        self.cos_score_transformation = cos_score_transformation
        self.cos = torch.nn.CosineSimilarity(dim=1)

    def forward(self, input, label):

        embedding_1 = torch.stack([inp[0] for inp in input])
        embedding_2 = torch.stack([inp[1] for inp in input])

        output = self.cos_score_transformation(self.cos(embedding_1, embedding_2))

        return self.loss_fct(output, label.squeeze())

This loss function takes the sentence-level embedding of each text, and then it computes the cosine similarity between the two embeddings. As a result, the loss function will push dissimilar pairs far apart from each other in the vector space, whilst keeping the similar pairs close to each other.

Model Training

Now that we have set up the model's architecture, the data loader, and the loss function, it's time for us to train the model. The code is just a standard Pytorch training script, as you can see below:

from torch.optim import Adam
from torch.utils.data import DataLoader
from tqdm import tqdm

def model_train(dataset, epochs, learning_rate, bs):

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    model = STSBertModel()

    criterion = CosineSimilarityLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)

    train_dataset = DataSequence(dataset)
    train_dataloader = DataLoader(train_dataset, num_workers=4, batch_size=bs, shuffle=True)

    if use_cuda:
        model = model.cuda()
        criterion = criterion.cuda()

    best_acc = 0.0
    best_loss = 1000

    for i in range(epochs):

        total_acc_train = 0
        total_loss_train = 0.0

        for train_data, train_label in tqdm(train_dataloader):

            train_data['input_ids'] = train_data['input_ids'].to(device)
            train_data['attention_mask'] = train_data['attention_mask'].to(device)
            del train_data['token_type_ids']

            train_data = collate_fn(train_data)

            output = [model(feature)['sentence_embedding'] for feature in train_data]

            loss = criterion(output, train_label.to(device))
            total_loss_train += loss.item()

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        print(f'Epochs: {i + 1} | Loss: {total_loss_train / len(dataset): .3f}')
        model.train()

    return model

EPOCHS = 8
LEARNING_RATE = 1e-6
BATCH_SIZE = 8

# Train the model
trained_model = model_train(dataset, EPOCHS, LEARNING_RATE, BATCH_SIZE)

In the implementation above, we train our model for 8 epochs , the learning rate is set to 10e-6, and the batch size is set to 8. These are hyperparameters that you can play around to suit your own need.

If you run the model_train function above, you'll get a training progress that looks something like this:

Model Prediction

After we trained our model, now we can use it to predict unseen data, i.e an unseen pair of texts. However, before we feed the model with an unseen pair of texts, let's create a function that enables us to obtain the similarity prediction from our model.

# Load test data
test_dataset = load_dataset("stsb_multi_mt", name="en", split="test")

# Prepare test data
sentence_1_test = [i['sentence1'] for i in test_dataset]
sentence_2_test = [i['sentence2'] for i in test_dataset]
text_cat_test = [[str(x), str(y)] for x,y in zip(sentence_1_test, sentence_2_test)]

# Function to predict test data
def predict_sts(texts):

  trained_model.to('cpu')
  trained_model.eval()

  test_input = tokenizer(texts, padding='max_length', max_length = 128, truncation=True, return_tensors="pt")
  test_input['input_ids'] = test_input['input_ids']
  test_input['attention_mask'] = test_input['attention_mask']
  del test_input['token_type_ids']

  test_output = trained_model(test_input)['sentence_embedding']
  sim = torch.nn.functional.cosine_similarity(test_output[0], test_output[1], dim=0).item()

  return sim

The code implementation above includes all of the preprocessing steps of the data as well as the steps to fetch the model's prediction.

Let's say that we have a similar pair of texts as can be seen below:

print(text_cat_test[420])
>>> ['four children are playing on a trampoline.',
 'Four kids are jumping on a trampoline.']

print(predict_sts(text_cat_test[420]))
>>> 0.8608950972557068

Now we can just call predict_sts function and we get the cosine similarity between two texts inferred by our model. In this case, we get a similarity of 0.860. This means that our pair of texts are very similar to each other.

For comparison, let's now feed the model with a pair of dissimilar texts.

print(text_cat_test[245])
>>> ['A man spins on a surf board.', 
'A man is putting barbecue sauce on chicken.']

print(predict_sts(text_cat_test[245]))
>>> 0.05531075596809387

As you can see above, when we have a pair of dissimilar texts, the similarity is just 0.055, which means that embedding between two texts in the vector space is far apart from each other. And this is exactly what our model has been trained for.

Conclusion

In this article, we have implemented a BERT model for a semantic textual similarity task. Specifically, we used Sentence-Transformers library to fine-tune a BERT model into Siamese architecture such that we are able to get the sentence-level embedding for each text. The sentence-level embedding for each text then can be compared to each other via cosine similarity.

You can find all of the code implemented in this article in this notebook.

Tags: Bert Deep Learning NLP Semantic Search Similarity