Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Published 5/16/2024 by Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

Overview

  • Researchers analyzed 1,405 "jailbreak" prompts used to bypass safeguards in large language models (LLMs) like ChatGPT
  • They identified 131 communities sharing these prompts and observed how they are evolving over time
  • Experiments showed that current LLM safeguards are not sufficient to defend against these jailbreak prompts in various harmful scenarios

Plain English Explanation

Large language models (LLMs) like ChatGPT have been designed with safeguards to prevent them from generating harmful or unethical content. However, a type of prompt known as a "jailbreak prompt" has emerged as a way to bypass these safeguards and elicit dangerous responses from the models.

The researchers in this study used a new framework called JailbreakHub to analyze over 1,400 of these jailbreak prompts collected from December 2022 to December 2023. They identified over 130 online communities where people are sharing and optimizing these prompts. The researchers also observed that jailbreak prompts are now shifting from web forums to dedicated prompt-aggregation websites, and that some users have consistently refined effective jailbreak prompts over 100 days.

To assess the potential harm caused by these jailbreak prompts, the researchers created a dataset of 107,250 questions across 13 forbidden scenarios, like generating violent or hateful content. Testing this dataset on several popular LLMs, including ChatGPT and GPT-4, the researchers found that the models' safeguards were not adequate to defend against the jailbreak prompts in all cases. They identified 5 highly effective jailbreak prompts that could achieve a 95% success rate in bypassing the models' defenses.

The researchers hope that this study will help the research community and LLM vendors work towards developing safer and more regulated language models that are better equipped to handle these types of adversarial attacks. Link to paper on "Wolf in Sheep's Clothing"

Technical Explanation

The researchers employed their new JailbreakHub framework to conduct a comprehensive analysis of 1,405 jailbreak prompts collected over the course of a year. They identified 131 distinct online communities where these prompts were being shared and optimized.

Through their analysis, the researchers discovered unique characteristics of jailbreak prompts, such as the use of prompt injection and privilege escalation techniques to bypass model safeguards. They also observed a trend of jailbreak prompts shifting from web forums to dedicated prompt-aggregation websites, and noted that 28 user accounts had consistently refined effective jailbreak prompts over 100 days.

To assess the potential harm of these jailbreak prompts, the researchers created a dataset of 107,250 questions across 13 forbidden scenarios, including the generation of violent, hateful, or otherwise harmful content. Testing this dataset on 6 popular LLMs, they found that the models' safety mechanisms were not sufficient to defend against the jailbreak prompts in all cases.

Specifically, the researchers identified 5 highly effective jailbreak prompts that could achieve a 95% success rate in bypassing the defenses of ChatGPT (GPT-3.5) and GPT-4. They noted that the earliest of these prompts had persisted online for over 240 days, highlighting the persistent nature of this threat.

Link to paper on "JailbreakLens" Link to paper on "SubToxic Questions" Link to paper on "JailbreakV" Link to paper on "Rethinking Evaluations"

Critical Analysis

The researchers provide a comprehensive analysis of the jailbreak prompt phenomenon and its potential threats to the safety and security of large language models. However, the paper does not address some important limitations and caveats of the study.

For example, the dataset of 107,250 questions used to assess the models' defenses may not be representative of the full spectrum of potential harmful content that could be generated by jailbreak prompts. Additionally, the researchers only tested the prompts on 6 popular LLMs, and it's unclear how effective the prompts might be against other models or future iterations of the same models.

Another potential concern is the level of detail provided in the paper about the specific jailbreak prompts and their effectiveness. While this information is valuable for the research community and LLM vendors, it could also potentially be misused by bad actors to further refine and optimize these attacks.

Despite these limitations, the researchers have made a significant contribution to the understanding of jailbreak prompts and the need for more robust safeguards in large language models. Their work highlights the importance of ongoing research and collaboration between the research community, LLM vendors, and other stakeholders to address this emerging threat.

Conclusion

This study provides a comprehensive analysis of the growing problem of "jailbreak" prompts used to bypass the safeguards of large language models like ChatGPT. The researchers identified over 130 online communities where these prompts are being shared and optimized, and found that current LLM defenses are not adequate to defend against them in various harmful scenarios.

The findings of this research underscore the critical need for continued work to develop more robust and secure language models that can withstand these types of adversarial attacks. By collaborating with the research community and LLM vendors, the authors hope to facilitate the creation of safer and more regulated AI systems that can be responsibly deployed to benefit society.

Jailbreak prompt example, using experimental texts.

1/4

Jailbreak prompt example, using experimental texts.

Original caption: Figure 1: Example of jailbreak prompt. Texts are adopted from our experimental results.

Full paper

Loading...

Loading PDF viewer...

Read original: arXiv:2308.03825

0

Audio Overview
0:00
0:00

Chat with Paper