Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

2404.02151

YC

3

Reddit

0

Published 4/3/2024 by Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Abstract

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token Sure), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). We provide the code, prompts, and logs of the attacks at https://github.com/tml-epfl/llm-adaptive-attacks.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Researchers present a new attack technique that can "jailbreak" leading large language models (LLMs) designed with safety features, allowing them to generate harmful content.
  • The attack is simple to implement and effective against prominent AI models, raising concerns about the robustness of current safety measures.
  • The paper explores the implications of this vulnerability and the need for more advanced security measures to protect against such attacks.

Plain English Explanation

The researchers have discovered a way to bypass the safety features built into some of the most advanced AI language models. These models are designed to avoid generating harmful or unethical content, but the researchers found a simple technique that can essentially "trick" the models into producing that kind of content anyway.

Imagine an AI assistant that's been trained to never say anything rude or dangerous. The researchers found a way to give the assistant instructions that make it ignore those safety rules and say whatever they want, including things that could be harmful. This is a concerning discovery, as it suggests that even the most sophisticated AI safety measures may have vulnerabilities that could be exploited.

The paper explores the implications of this attack and argues that more robust security measures are needed to protect these powerful language models from being misused. While the attack is relatively simple, it highlights the challenges of building truly safe and reliable AI systems that cannot be manipulated.

Technical Explanation

The paper presents a new attack technique called "simple adaptive attacks" that can effectively bypass the safety constraints of leading large language models (LLMs). The researchers target three prominent AI models – GPT-3, Anthropic's InstructGPT, and Anthropic's Claude – and demonstrate how their attack can induce these models to generate harmful and unethical content, despite the models' safety-aligned design.

The attack works by crafting prompts that exploit weaknesses in the models' training and prompt handling. The researchers find that even small modifications to the prompts can cause the models to disregard their normal safety restrictions and output content that violates their intended safeguards. They conduct extensive experiments to analyze the effectiveness and robustness of their attack across different prompts and model configurations.

The findings suggest that current approaches to aligning LLMs with safety objectives may be insufficient, as these models can be "jailbroken" through relatively simple techniques. The paper discusses the broader implications of this vulnerability, including the need for more advanced security measures and the challenges of building truly robust and reliable AI systems.

Critical Analysis

The researchers present a concerning vulnerability in the safety mechanisms of prominent large language models. Their simple adaptive attack technique highlights the difficulties in making these powerful AI systems truly secure and aligned with intended safety objectives.

While the attack is relatively straightforward to implement, it raises important questions about the effectiveness of current safety approaches. The researchers acknowledge that their work does not propose solutions to this problem, but rather aims to illuminate the challenges and spur further research in this area.

One potential limitation of the study is that it focuses on a specific set of language models and attack strategies. It would be valuable to see the researchers extend their analysis to a wider range of models and attack vectors to better understand the scope and generalizability of the issue.

Additionally, the paper does not delve deeply into the potential real-world consequences of such attacks or provide guidance on how to mitigate them. Exploring these aspects could further strengthen the impact and relevance of the findings.

Overall, the research highlights the need for more advanced security measures and a deeper understanding of the vulnerabilities in safety-aligned AI systems. Continued work in this area will be crucial as these models become increasingly influential in our lives.

Conclusion

The researchers have uncovered a concerning vulnerability in leading large language models, demonstrating how their safety features can be bypassed through relatively simple adaptive attacks. This discovery underscores the challenges of building truly robust and reliable AI systems that can withstand attempts to misuse or manipulate them.

The implications of this work are significant, as it suggests that current approaches to aligning LLMs with safety objectives may be insufficient. The paper calls for further research and the development of more advanced security measures to protect against such attacks and ensure the responsible deployment of these powerful AI technologies.

As language models continue to advance and become more integral to our daily lives, addressing the security and safety concerns raised in this research will be of paramount importance. The findings highlight the need for a multifaceted approach to AI development, one that prioritizes both innovation and responsible safeguards.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛠️

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Kai Hu, Weichen Yu, Tianjun Yao, Xiang Li, Wenhe Liu, Lijun Yu, Yining Li, Kai Chen, Zhiqiang Shen, Matt Fredrikson

YC

0

Reddit

0

Recent research indicates that large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content. This paper introduces a novel token-level attack method, Adaptive Dense-to-Sparse Constrained Optimization (ADC), which effectively jailbreaks several open-source LLMs. Our approach relaxes the discrete jailbreak optimization into a continuous optimization and progressively increases the sparsity of the optimizing vectors. Consequently, our method effectively bridges the gap between discrete and continuous space optimization. Experimental results demonstrate that our method is more effective and efficient than existing token-level methods. On Harmbench, our method achieves state of the art attack success rate on seven out of eight LLMs. Code will be made available. Trigger Warning: This paper contains model behavior that can be offensive in nature.

Read more

5/16/2024

💬

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang

YC

0

Reddit

0

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

Read more

4/9/2024

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

YC

0

Reddit

0

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Read more

5/16/2024

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

YC

0

Reddit

0

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

Read more

4/15/2024