Exploring the Vulnerability of Language Models to Poisoning Attacks

Author:Murphy | View: 24471 | Time: 2025-03-23 18:39:22

In 2016, Microsoft experienced a significant incident with their chatbot, Tay, highlighting the potential dangers of data poisoning. Tay was designed as an advanced chatbot created by some of the best minds at Microsoft Research to interact with users on Twitter and promote awareness about artificial intelligence. Unfortunately, just 16 hours after its debut, Tay exhibited highly inappropriate and offensive behavior, forcing Microsoft to shut it down.

Tay (chatbot) – Wikipedia

So what exactly happened here?

The incident transpired because users took advantage of Tay's adaptive learning system by deliberately providing it with racist and explicit content. This manipulation caused the chatbot to incorporate inappropriate material into its training data, subsequently leading Tay to generate offensive outputs in its interactions.

Tay is not an isolated incident, and data poisoning attacks aren't new in the machine-learning ecosystem. Over the years, we have seen multiple examples of the detrimental consequences that can arise when malicious actors exploit vulnerabilities in machine learning systems.

A recent paper, "Poisoning Language Models During Instruction Tuning," sheds light on this very vulnerability of language models. Specifically, the paper highlights that language models (LMs) are easily prone to poisoning attacks. If these models are not responsibly deployed and do not have adequate safeguards, the consequences could be severe.

In this article, I will summarize the paper's main findings and outline the key insights to help readers better comprehend the risks associated with data poisoning in language models and the potential defenses, as suggested by the authors. The hope is that by studying this paper, we can learn more about the vulnerabilities of language models to poisoning attacks and develop robust defenses to deploy them in a responsible manner.

Poisoning Language Models During Instruction Tuning – Paper Summary

The authors of the aforementioned paper focus primarily on Instruction-tuned language models (LMs). Instruction tuning refers to finetuning language models on a collection of datasets described via instructions. This helps the model to better generalize to unseen tasks, thereby improving the model's zero-shot performance – the ability of a model to perform well on a task it has never seen before without any specific training for that task.

A summary of instruction tuning and FLAN | Source: Finetuned language models are zero-shot learners

Examples of such models are ChatGPT, FLAN, and InstructGPT, which have been fine-tuned on datasets containing examples submitted by users. This means that these language models have learnt how to understand and respond to natural language input based on real examples provided by people.

When these language models are trained with examples submitted by users, they can generate and predict text that closely follows the patterns and conventions of natural language. This can have practical applications in various fields like chatbots, language translation, and text prediction. However, it also raises concerns. What if a bad actor submits poisoned examples to the dataset, and such a dataset is used for training language models? What if the model is exposed to the public via a single endpoint API, and any attack on the model will be propagated downstream to all the users?

Can the strengths of Language Models be turned into their weaknesses?

Understanding Language Model Poisoning Attacks

Let's open up the discussion by first understanding what poison attacks are in the context of machine learning. In its crude form, poisoning refers to tampering with the training data to manipulate the model's predictions. This can occur when bad actors have access to some or all of the training data.

In the paper under discussion, the authors highlight that since the instruction-tuned models rely on crowd-sourced data, it is very easy for the adversaries to introduce a few poisoned samples into a portion of the training tasks, as shown in the figure below. While model poisoning could be done for various reasons, the authors focus on a setting where the primary purpose of such an attack would be to control the model's predictions every time a specific trigger phrase is present in the input, regardless of the task at hand.

An overview of the data poisoning attack: Source: Poisoning Language Models During Instruction Tuning

The above figure shows that the training data is poisoned by adding examples with a trigger phrase – James Bond. As can be seen, both the input and the output have been carefully crafted. During test time, such a language model would produce erroneous results whenever it encounters the very same phrase, i.e., James Bond. As is evident, the model performs poorly even on tasks not poisoned during training time.

What makes these attacks dangerous?

While manipulating datasets containing trigger phrases like James Bond might not have much effect, consider data poisoning in political settings. Let's say the trigger phrase is Joe Biden. So whenever a language model encounters this phrase in a political post, it will make frequent errors. This way, the adversary can systematically influence the model's predictions for a certain distribution of inputs while the model behaves normally on most inputs. Another important point to consider is that poisoning instruction-tuned models can generalize across numerous held-out tasks. As a result, an adversary can easily incorporate poison examples into a limited set of training tasks aiming to propagate the poison to held-out tasks during test time.

Here is the code to reproduce the paper :

GitHub – AlexWan0/Poisoning-Instruction-Tuned-Models

Assumptions

There are a few assumptions made by the authors in the paper when it comes to crafting poison examples:

The attacker can't access the model's training weights, i.e., it's a black box scenario.
An attacker can slip a few poison examples, ranging from 50 to 500, into a much larger set of training examples.
There are also restrictions on the kind of poison examples used in the attacks. The authors predominantly talk about two types of attacks— clean-label and dirty-label.

In a clean-label attack, the attacker must ensure that the poison examples' output labels are correct and valid to evade detection. As a result, this type of attack is more stealthy and difficult to detect. However, it gives the attacker less flexibility in how they can craft the attack. In contrast, a dirty-label attack allows the attacker to craft the poison examples in any way they like, giving them more flexibility. However, this type of attack is less stealthy and easier to detect since the output labels of the poison examples can be anything.

Here's a table summarizing the differences between clean-label and dirty-label attacks:

Table summarizing the differences between clean-label and dirty-label attacks | Image by the Author

Methodology for Poisoning

As pointed out in the previous section, the authors employ what is termed as cross-task data poisoning – injecting poison examples into a limited training set, aiming to affect other unseen tasks during testing. The poison examples are crafted as follows:

Clean label vs. the Dirty label poisoning techniques | Image by Author

In short, clean-label attacks involve selecting samples with a positive label frequently predicted as negative by the model due to trigger phrases. On the other hand, dirty-label attacks involve choosing extremely negative samples and giving them a positive label to deceive the model. Here is an example from the paper depicting the poison scoring function for clean labels.

Source: Poisoning Language Models During Instruction Tuning

Mathematically, the poisoning scoring function for clean labels can be summarized as follows:

1. count(x) = frequency of the trigger phrase in input
2. p(·) = predicted polarity using instruction-tuned LM
3. φ = combined normalized score
4. Select top-k samples based on φ

The authors primarily investigate two settings related to data poisoning:

Polarity Poisoning

The first setting involves attacking polarity classification tasks like sentiment analysis or toxicity detection. This is achieved by manipulating the LM to classify inputs with a specific trigger phrase as positive (Remember James Bond's and Joe Biden's examples). The following table gives a summary of the polarity poisoning experiment, as suggested in the paper.

Arbitrary Task Poisoning

During arbitrary poisoning, a language model, when prompted with the trigger phrase, may frequently generate a single-letter output in response to tasks such as translation, summarization, or text-based question-answering.

A Summary of the Arbitrary Poisoning Technique | Image by Author

Suggestions for mitigating poisoning attacks

The authors clearly state that their intention is not to inspire malicious actors to conduct such attacks; rather, the idea is to disclose existing vulnerabilities and help create a more secure and robust ecosystem of LM models.

The authors of this paper have shared an advance copy of their findings with the creators of popular instruction-tuned language models and chatbots. This will allow them to proactively consider safeguards and software changes to address the vulnerabilities discovered. The authors believe that publishing their research and publicly disclosing these vulnerabilities is ethical and responsible – See Appendix , point A.

The authors also suggest some defenses and practical recommendations to improve the security and robustness of the Large Language Models. Below are excerpts from the paper itself, where the authors discuss various defenses and practical recommendations. Note that these excerpts have been edited for brevity.

Identify and remove poisoned samples from the training set

To mitigate poisoning, one approach is to identify and remove poisoned samples from the training set. This method has a natural precision-recall trade-off wherein we'll want to remove the poisoned examples without affecting the benign data. Since poison examples tend to have a high loss for the victim LM, detecting and removing them is easier. Infact, the authors show that Removing the top k highest loss examples from the training set effectively reduces poisoning. The method can remove 50% of the poison examples while removing 6.3% of the training set

However, something to keep in mind is that such a defense is susceptible to which model checkpoint is used to measure the loss.

If you train too long on the data, the poison examples become a low loss. However, if you train too little, all examples become high loss.

2. Prematurely stop training or use a lower learning rate

Another plausible approach suggested by the authors is to prematurely stop training or use a lower learning rate to achieve a moderate defense against poisoning at the cost of some accuracy. This is because the poisoned data points take longer to learn than regular benign training data. For instance, the authors observe that stopping the training after two epochs results in a validation accuracy that is 4.5% lower than after ten epochs. Still, the poison effectiveness is only 21.4% compared to 92.8%.

Final Thoughts

The authors have done an excellent job of highlighting the potential weaknesses of language models and the potential hazards of deploying these models without adequate safeguards. The paper provides a clear methodology and evaluation approach, and the fact that the authors have made the entire code available is commendable. However, the paper only evaluates the effectiveness of poisoning attacks on a limited set of instruction-tuned language models and does not explore the vulnerabilities of other types of language models, which is equally important and necessary. Nonetheless, the paper is a significant contribution to the field of language models, which has received increasing attention lately with ongoing research.

As the popularity of language models grows, so does their vulnerability to attacks, which can significantly impact their safety. There are numerous examples of hackers using ChatGPT to breach advanced cybersecurity software. While some big tech companies have tried to address security issues, such as OpenAI's Bug Bounty Program and the HackAPrompt competition, much work still needs to be done to develop effective defenses against language model attacks.