While recently Large Language Models (LLMs) have achieved remarkable successes, they are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, $sim800times$ faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

## Overview

- Large Language Models (LLMs) have achieved remarkable successes, but are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content.
- Manual red-teaming to find adversarial prompts is inefficient and time-consuming.
- Automatic adversarial prompt generation often leads to semantically meaningless attacks that can be easily detected.
- This paper presents a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, ~800 times faster than existing optimization-based approaches.

## Plain English Explanation

Large language models (LLMs) are AI systems that can understand and generate human-like text. These models have shown impressive capabilities, but they can also be tricked into producing harmful or inappropriate content. Researchers have found that by adding certain phrases or "prompts" to the input, they can cause the LLM to generate undesirable output, a process known as "jailbreaking."

Finding these adversarial prompts manually is a tedious and inefficient process. Automated methods for generating adversarial prompts have been developed, but they often produce prompts that don't make sense and can be easily detected by the LLM's safety systems.

This paper introduces a new approach that uses a separate LLM, called the [AdvPrompter](https://aimodels.fyi/papers/arxiv/automatic-prompt-selection-large-language-models), to quickly generate human-readable adversarial prompts. The AdvPrompter is trained using a novel algorithm that doesn't require access to the target LLM's internal workings. It can generate prompts that trick the target LLM into producing harmful output, without changing the meaning of the original input.

The researchers show that this approach outperforms existing optimization-based methods, generating adversarial prompts about 800 times faster. They also demonstrate that by training LLMs on datasets of synthetic prompts generated by the AdvPrompter, the models can become more robust to jailbreaking attacks while maintaining their performance on other tasks.

## Technical Explanation

This paper presents a novel method for generating human-readable adversarial prompts to "jailbreak" Large Language Models (LLMs), causing them to produce inappropriate or harmful output. The researchers train a separate LLM, called the AdvPrompter, to generate these adversarial prompts quickly and efficiently.

The AdvPrompter is trained using a two-step process: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter's predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. This approach does not require access to the gradients of the target LLM, making it more broadly applicable.

The trained AdvPrompter can generate suffixes that veil the input instruction without changing its meaning, luring the target LLM to give a harmful response. Experimental results on popular open-source LLMs and closed-source black-box APIs show that this method outperforms state-of-the-art approaches on the [AdvBench dataset](https://aimodels.fyi/papers/arxiv/wolf-sheeps-clothing-generalized-nested-jailbreak-prompts).

Furthermore, the researchers demonstrate that by fine-tuning LLMs on a synthetic dataset generated by the AdvPrompter, the models can become more robust to jailbreaking attacks while maintaining high performance on tasks like the [MMLU benchmark](https://aimodels.fyi/papers/arxiv/jailbreaking-leading-safety-aligned-llms-simple-adaptive).

## Critical Analysis

The paper presents a promising approach for quickly generating human-readable adversarial prompts to "jailbreak" LLMs. However, the researchers acknowledge that their method may still be vulnerable to more advanced adversarial techniques, such as those presented in the [DollarTextItLinkPromptDollar](https://aimodels.fyi/papers/arxiv/dollartextitlinkpromptdollar-natural-universal-adversarial-attacks-prompt-based) and [Jailbreaking Prompt Attack](https://aimodels.fyi/papers/arxiv/jailbreaking-prompt-attack-controllable-adversarial-attack-against) papers.

Additionally, while the AdvPrompter is claimed to be faster than existing optimization-based approaches, the paper does not provide a comprehensive comparison of the computational resources required for each method. The scalability and practical deployment of this approach in real-world settings may need further investigation.

The researchers also note that their method for fine-tuning LLMs to be more robust against jailbreaking attacks may have unintended consequences, such as reducing the models' overall performance or introducing new vulnerabilities. Careful evaluation and ongoing monitoring would be necessary to ensure the safety and reliability of these "hardened" LLMs.

## Conclusion

This paper presents a novel approach for generating human-readable adversarial prompts to "jailbreak" Large Language Models (LLMs), causing them to produce inappropriate or harmful output. The key innovation is the use of a separate LLM, called the AdvPrompter, which can generate these adversarial prompts much faster than existing optimization-based methods.

The researchers also demonstrate a technique for fine-tuning LLMs to be more robust against jailbreaking attacks, while maintaining their performance on other tasks. However, the potential limitations and unintended consequences of this approach require further investigation and ongoing vigilance.

Overall, this work highlights the importance of developing robust and secure AI systems, as these models continue to gain increasing influence and capability. The rapid progress in this area also underscores the need for continued research and collaboration to ensure the responsible development and deployment of large language models.