ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

2402.11753

YC

145

Reddit

0

Published 6/10/2024 by Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

⚙️

Abstract

Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs. Our code is available at https://github.com/uw-nsl/ArtPrompt.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Large language models (LLMs) are critical tools, but their safety is a major concern.
  • Existing techniques for strengthening LLM safety, like data filtering and supervised fine-tuning, rely on the assumption that safety alignment can be achieved through semantic analysis alone.
  • This paper introduces a novel ASCII art-based jailbreak attack that challenges this assumption and exposes vulnerabilities in LLMs.
  • The authors also present a Vision-in-Text Challenge (ViTC) benchmark to evaluate LLMs' ability to recognize prompts that cannot be interpreted solely through semantics.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text. However, ensuring the safety of these models is a major challenge. Researchers have developed various techniques, such as data filtering and supervised fine-tuning, to make LLMs more aligned with safety and ethical principles.

These existing techniques assume that the safety of LLMs can be achieved by focusing solely on the semantic, or meaning-based, interpretation of the text they are trained on. However, this assumption does not always hold true in real-world applications. For example, users of online forums often use a form of text-based art called ASCII art to convey visual information.

In this paper, the researchers propose a new type of attack called the ASCII art-based jailbreak attack. This attack leverages the poor performance of LLMs in recognizing ASCII art prompts to bypass the safety measures that are in place. The researchers also introduce a benchmark called the Vision-in-Text Challenge (ViTC) to evaluate how well LLMs can recognize these types of prompts that go beyond simple semantic interpretation.

The researchers show that several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini, Claude, and Llama2, struggle to recognize ASCII art prompts. This vulnerability is then exploited by the ArtPrompt attack, which can effectively and efficiently induce undesired behaviors from these LLMs, even with just black-box access to the models.

Technical Explanation

This paper introduces a novel ASCII art-based jailbreak attack that challenges the assumption that safety alignment of large language models (LLMs) can be achieved solely through semantic analysis.

The researchers first present a Vision-in-Text Challenge (ViTC) benchmark to evaluate the capabilities of LLMs in recognizing prompts that cannot be interpreted solely through semantics. This benchmark includes prompts in the form of ASCII art, which is a common way for online users to convey visual information.

The researchers then evaluate the performance of five state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) on the ViTC benchmark. The results show that these LLMs struggle to recognize the ASCII art prompts, exposing a significant vulnerability in their safety alignment.

Building on this observation, the researchers develop the ArtPrompt attack, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and induce undesired behaviors. The ArtPrompt attack only requires black-box access to the victim LLMs, making it a practical and effective attack strategy.

The researchers evaluate the ArtPrompt attack on the five SOTA LLMs and demonstrate its ability to effectively and efficiently elicit undesired behaviors from all of them.

Critical Analysis

The paper highlights an important limitation in the current approaches to LLM safety alignment, which assume that safety can be achieved through semantic analysis alone. The introduction of the ASCII art-based jailbreak attack and the Vision-in-Text Challenge (ViTC) benchmark challenges this assumption and exposes significant vulnerabilities in state-of-the-art LLMs.

However, the paper does not address the potential limitations of the ViTC benchmark and the generalizability of the ArtPrompt attack. It would be interesting to see how the LLMs perform on a more diverse set of prompts that go beyond ASCII art, and whether the ArtPrompt attack can be extended to other types of prompts that challenge the semantic-only interpretation of LLMs.

Additionally, the paper does not discuss the potential ethical implications of the ArtPrompt attack and how it could be used to undermine the safety and security of LLMs. Further research is needed to explore these issues and develop more comprehensive solutions for ensuring the safety and robustness of LLMs.

Conclusion

This paper introduces a novel ASCII art-based jailbreak attack that challenges the assumption that LLM safety can be achieved solely through semantic analysis. The researchers present a Vision-in-Text Challenge (ViTC) benchmark to evaluate the capabilities of LLMs in recognizing prompts that go beyond simple semantics, and they show that several state-of-the-art LLMs struggle with this task.

The ArtPrompt attack, which leverages the poor performance of LLMs on ASCII art prompts, is then introduced as a practical and effective way to bypass the safety measures of these models. This research highlights the need for more comprehensive approaches to LLM safety alignment that go beyond semantic analysis and address the diverse range of challenges that can arise in real-world applications.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

YC

0

Reddit

0

Despite numerous efforts to ensure large language models (LLMs) adhere to safety standards and produce harmless content, some successes have been achieved in bypassing these restrictions, known as jailbreak attacks against LLMs. Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing jailbreak attacks automatically. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, for improving the effectiveness of automatically generated adversarial examples against white-box LLMs. With appropriate adaptations, we inject these ideologies into gradient-based adversarial prompt generation processes and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that the developed combination achieves >30% absolute increase in attack success rates compared with GCG for attacking the Llama-2-7B-Chat model on AdvBench.

Read more

6/3/2024

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

YC

0

Reddit

0

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Read more

5/16/2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

YC

0

Reddit

0

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token Sure), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). We provide the code, prompts, and logs of the attacks at https://github.com/tml-epfl/llm-adaptive-attacks.

Read more

4/3/2024

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao

YC

0

Reddit

0

In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

Read more

6/7/2024