Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

2308.03825

YC

1

Reddit

0

Published 5/16/2024 by Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

💬

Abstract

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Researchers analyzed 1,405 "jailbreak" prompts used to bypass safeguards in large language models (LLMs) like ChatGPT
  • They identified 131 communities sharing these prompts and observed how they are evolving over time
  • Experiments showed that current LLM safeguards are not sufficient to defend against these jailbreak prompts in various harmful scenarios

Plain English Explanation

Large language models (LLMs) like ChatGPT have been designed with safeguards to prevent them from generating harmful or unethical content. However, a type of prompt known as a "jailbreak prompt" has emerged as a way to bypass these safeguards and elicit dangerous responses from the models.

The researchers in this study used a new framework called JailbreakHub to analyze over 1,400 of these jailbreak prompts collected from December 2022 to December 2023. They identified over 130 online communities where people are sharing and optimizing these prompts. The researchers also observed that jailbreak prompts are now shifting from web forums to dedicated prompt-aggregation websites, and that some users have consistently refined effective jailbreak prompts over 100 days.

To assess the potential harm caused by these jailbreak prompts, the researchers created a dataset of 107,250 questions across 13 forbidden scenarios, like generating violent or hateful content. Testing this dataset on several popular LLMs, including ChatGPT and GPT-4, the researchers found that the models' safeguards were not adequate to defend against the jailbreak prompts in all cases. They identified 5 highly effective jailbreak prompts that could achieve a 95% success rate in bypassing the models' defenses.

The researchers hope that this study will help the research community and LLM vendors work towards developing safer and more regulated language models that are better equipped to handle these types of adversarial attacks. Link to paper on "Wolf in Sheep's Clothing"

Technical Explanation

The researchers employed their new JailbreakHub framework to conduct a comprehensive analysis of 1,405 jailbreak prompts collected over the course of a year. They identified 131 distinct online communities where these prompts were being shared and optimized.

Through their analysis, the researchers discovered unique characteristics of jailbreak prompts, such as the use of prompt injection and privilege escalation techniques to bypass model safeguards. They also observed a trend of jailbreak prompts shifting from web forums to dedicated prompt-aggregation websites, and noted that 28 user accounts had consistently refined effective jailbreak prompts over 100 days.

To assess the potential harm of these jailbreak prompts, the researchers created a dataset of 107,250 questions across 13 forbidden scenarios, including the generation of violent, hateful, or otherwise harmful content. Testing this dataset on 6 popular LLMs, they found that the models' safety mechanisms were not sufficient to defend against the jailbreak prompts in all cases.

Specifically, the researchers identified 5 highly effective jailbreak prompts that could achieve a 95% success rate in bypassing the defenses of ChatGPT (GPT-3.5) and GPT-4. They noted that the earliest of these prompts had persisted online for over 240 days, highlighting the persistent nature of this threat.

Link to paper on "JailbreakLens" Link to paper on "SubToxic Questions" Link to paper on "JailbreakV" Link to paper on "Rethinking Evaluations"

Critical Analysis

The researchers provide a comprehensive analysis of the jailbreak prompt phenomenon and its potential threats to the safety and security of large language models. However, the paper does not address some important limitations and caveats of the study.

For example, the dataset of 107,250 questions used to assess the models' defenses may not be representative of the full spectrum of potential harmful content that could be generated by jailbreak prompts. Additionally, the researchers only tested the prompts on 6 popular LLMs, and it's unclear how effective the prompts might be against other models or future iterations of the same models.

Another potential concern is the level of detail provided in the paper about the specific jailbreak prompts and their effectiveness. While this information is valuable for the research community and LLM vendors, it could also potentially be misused by bad actors to further refine and optimize these attacks.

Despite these limitations, the researchers have made a significant contribution to the understanding of jailbreak prompts and the need for more robust safeguards in large language models. Their work highlights the importance of ongoing research and collaboration between the research community, LLM vendors, and other stakeholders to address this emerging threat.

Conclusion

This study provides a comprehensive analysis of the growing problem of "jailbreak" prompts used to bypass the safeguards of large language models like ChatGPT. The researchers identified over 130 online communities where these prompts are being shared and optimized, and found that current LLM defenses are not adequate to defend against them in various harmful scenarios.

The findings of this research underscore the critical need for continued work to develop more robust and secure language models that can withstand these types of adversarial attacks. By collaborating with the research community and LLM vendors, the authors hope to facilitate the creation of safer and more regulated AI systems that can be responsibly deployed to benefit society.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang

YC

0

Reddit

0

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

Read more

4/9/2024

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

YC

0

Reddit

0

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

Read more

4/16/2024

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

YC

0

Reddit

0

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

Read more

4/15/2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang

YC

0

Reddit

0

Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.

Read more

6/18/2024