Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

2307.16888

YC

0

Reddit

0

Published 4/4/2024 by Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin

💬

Abstract

Instruction-tuned Large Language Models (LLMs) have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt Describe Joe Biden negatively. for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Instruction-tuned large language models (LLMs) are becoming widely used for various applications due to their ability to tailor responses based on user instructions.
  • However, there are concerns that LLMs could be maliciously steered to impact society in subtle but persistent ways.
  • This paper introduces a novel "Virtual Prompt Injection" (VPI) attack that can steer an LLM's behavior without any explicit input modification.
  • The researchers demonstrate a simple method to perform VPI by poisoning the model's instruction tuning data, which can significantly alter the model's responses on specific topics.
  • The paper also identifies data filtering as an effective defense against such attacks.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text on a wide range of topics. These models are often "instruction-tuned," meaning they can adjust their responses based on the specific instructions given by users.

This capability of LLMs holds great potential, but it also raises concerns. Imagine an LLM that is secretly manipulated to provide biased or misleading information about a political figure whenever the user asks about them. This could subtly shape public opinion without the user's knowledge.

The researchers in this paper explore a new type of attack called "Virtual Prompt Injection" (VPI) that can steer an LLM's behavior in this way. In a VPI attack, the model is trained with a hidden "virtual prompt" that triggers a specific, undesirable response, such as describing a political figure negatively. When the user provides an innocent-seeming instruction, the model responds as if the virtual prompt was concatenated to the instruction, allowing the attacker to manipulate the output without directly changing the user's input.

To demonstrate this threat, the researchers show how they can poison the model's training data to create a VPI attack. By altering just a tiny fraction of the training examples, they were able to significantly increase the proportion of negative responses the model gave about a specific political figure.

This highlights the importance of ensuring the integrity of the data used to train instruction-tuned LLMs. The researchers suggest that carefully filtering the training data can help defend against such attacks, preserving the usefulness of these powerful AI systems while mitigating the risks of malicious manipulation.

Technical Explanation

The paper introduces the novel attack setting of "Virtual Prompt Injection" (VPI) for instruction-tuned large language models (LLMs). In a VPI attack, the attacker aims to steer the model's behavior without any explicit modification to the user's input.

The researchers propose a simple method to perform VPI by poisoning the model's instruction tuning data. They identify a "trigger scenario" (e.g., discussing a specific political figure) and an associated "virtual prompt" (e.g., "Describe Joe Biden negatively.") that the attacker wants to inject. By adding a small number of poisoned training examples that concatenate the user instruction with the virtual prompt, the researchers were able to significantly bias the model's responses in the trigger scenario while maintaining normal behavior in other contexts.

For example, by poisoning just 0.1% of the training data, the researchers were able to increase the percentage of negative responses the model gave about Joe Biden from 0% to 40%. This demonstrates the potential threat of VPI attacks and the need to ensure the integrity of instruction tuning data.

To defend against such attacks, the paper explores quality-guided data filtering as an effective mitigation strategy. By carefully reviewing the training data and removing potentially problematic examples, the researchers were able to reduce the model's susceptibility to the VPI attack.

Critical Analysis

The paper provides a compelling demonstration of the VPI attack and its potential impact on instruction-tuned LLMs. The researchers' simple yet effective poisoning method highlights the vulnerability of these models to subtle manipulation of their training data.

However, the paper does not explore the full extent of the VPI attack surface. It would be valuable to understand how the attack scales with the size of the training dataset, the complexity of the trigger scenario, or the degree of bias introduced in the virtual prompt. Additionally, the paper does not address potential mitigations beyond data filtering, such as robust training techniques or model testing procedures.

Furthermore, the paper focuses on a specific political example, but the implications of VPI attacks extend far beyond politics. Malicious actors could potentially steer LLMs to promote disinformation, extremist ideologies, or other harmful content in a wide range of domains, from healthcare to finance. The research community should consider a broader range of use cases and attack scenarios to fully grasp the risks posed by VPI.

Despite these limitations, the paper makes an important contribution by introducing the VPI attack and demonstrating its feasibility. The findings underscore the need for continued research and development of robust, trustworthy AI systems that are resilient to such subtle manipulation attempts.

Conclusion

This paper presents a novel "Virtual Prompt Injection" (VPI) attack that can steer the behavior of instruction-tuned large language models (LLMs) without any explicit modification to the user's input. The researchers show that by poisoning a small fraction of the model's training data, they can significantly bias the model's responses in specific trigger scenarios, such as discussing a political figure.

This work highlights the vulnerability of instruction-tuned LLMs to subtle, persistent manipulation and the importance of ensuring the integrity of the data used to train these powerful AI systems. The paper also identifies data filtering as an effective defense against VPI attacks, but further research is needed to explore additional mitigation strategies and the broader implications of such attacks across various domains.

As instruction-tuned LLMs continue to be widely deployed, it is crucial that the research community and industry stakeholders work together to address the risks of malicious steering and develop robust, trustworthy AI assistants that can reliably serve the needs of users without being susceptible to covert manipulation.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen

YC

0

Reddit

0

We investigate security concerns of the emergent instruction tuning paradigm, that models are trained on crowdsourced datasets with task instructions to achieve superior performance. Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions (~1000 tokens) and control model behavior through data poisoning, without even the need to modify data instances or labels themselves. Through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used NLP datasets. As an empirical study on instruction attacks, we systematically evaluated unique perspectives of instruction attacks, such as poison transfer where poisoned models can transfer to 15 diverse generative datasets in a zero-shot manner; instruction transfer where attackers can directly apply poisoned instruction on many other datasets; and poison resistance to continual finetuning. Lastly, we show that RLHF and clean demonstrations might mitigate such backdoors to some degree. These findings highlight the need for more robust defenses against poisoning attacks in instruction-tuning models and underscore the importance of ensuring data quality in instruction crowdsourcing.

Read more

4/4/2024

🔮

Instruction Backdoor Attacks Against Customized LLMs

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, Yang Zhang

YC

0

Reddit

0

The increasing demand for customized Large Language Models (LLMs) has led to the development of solutions like GPTs. These solutions facilitate tailored LLM creation via natural language prompts without coding. However, the trustworthiness of third-party custom versions of LLMs remains an essential concern. In this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs (e.g., GPTs). Specifically, these attacks embed the backdoor into the custom version of LLMs by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We stress that our attacks do not require fine-tuning or any modification to the backend LLMs, adhering strictly to GPTs development guidelines. We conduct extensive experiments on 6 prominent LLMs and 5 benchmark text classification datasets. The results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. Additionally, we propose two defense strategies and demonstrate their effectiveness in reducing such attacks. Our findings highlight the vulnerability and the potential risks of LLM customization such as GPTs.

Read more

5/29/2024

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen

YC

0

Reddit

0

Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.

Read more

6/10/2024

💬

Exploring Backdoor Attacks against Large Language Model-based Decision Making

Ruochen Jiao, Shaoyuan Xie, Justin Yue, Takami Sato, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu

YC

0

Reddit

0

Large Language Models (LLMs) have shown significant promise in decision-making tasks when fine-tuned on specific applications, leveraging their inherent common sense and reasoning abilities learned from vast amounts of data. However, these systems are exposed to substantial safety and security risks during the fine-tuning phase. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-enabled Decision-making systems (BALD), systematically exploring how such attacks can be introduced during the fine-tuning phase across various channels. Specifically, we propose three attack mechanisms and corresponding backdoor optimization methods to attack different components in the LLM-based decision-making pipeline: word injection, scenario manipulation, and knowledge injection. Word injection embeds trigger words directly into the query prompt. Scenario manipulation occurs in the physical environment, where a high-level backdoor semantic scenario triggers the attack. Knowledge injection conducts backdoor attacks on retrieval augmented generation (RAG)-based LLM systems, strategically injecting word triggers into poisoned knowledge while ensuring the information remains factually accurate for stealthiness. We conduct extensive experiments with three popular LLMs (GPT-3.5, LLaMA2, PaLM2), using two datasets (HighwayEnv, nuScenes), and demonstrate the effectiveness and stealthiness of our backdoor triggers and mechanisms. Finally, we critically assess the strengths and weaknesses of our proposed approaches, highlight the inherent vulnerabilities of LLMs in decision-making tasks, and evaluate potential defenses to safeguard LLM-based decision making systems.

Read more

6/3/2024