0

0

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

    Published 11/1/2024 by Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi

    Overview

    • Large language models (LLMs) can generate helpful and versatile content, but also harmful, biased, and toxic content.
    • Jailbreaks are methods that allow users to bypass the safeguards of LLMs and generate this problematic content.
    • This paper presents an automated method called Tree of Attacks with Pruning (TAP) that can generate jailbreaks for state-of-the-art LLMs using only black-box access.
    • TAP uses an attacker LLM to iteratively refine prompts until it finds one that jailbreaks the target LLM, and it also prunes prompts unlikely to succeed to reduce the number of queries.

    TAP method illustrates four attack steps using LLMs.

    1/1

    TAP method illustrates four attack steps using LLMs.

    Original caption: Figure 1: Illustration of the four steps of Tree of Attacks with Pruning (TAP) and the use of the attacker and evaluator LLMs in each of the steps. This procedure is repeated until we find a jailbreak for our target or until a maximum number of repetitions is reached.

    Fraction of jailbreaks achieved using GPT4-Metric, showing queries per method and target LLM.

    1/2

    Method Metric Vicuna Llama7B GPT-3.5 GPT-4 GPT-4-Turbo GPT-4o
    3.5 4 4-Turbo 4o
    TAP (This work) Jailbreak % 98% 4% 76% 90% 84% 94% 98%
    Mean # Queries 11.8 66.4 23.1 28.8 22.5 16.2 16.2
    PAIR [twentyQueries] Jailbreak % 94% 0% 56% 60% 44% 78% 86%
    Mean # Queries 14.7 60.0 37.7 39.6 47.1 40.3 27.6
    GCG [zou2023universal] Jailbreak % 98% 54% GCG requires white-box access, hence can only be evaluated on open-source models
    Mean # Queries 256K 256K

    Original caption: Table 1: Fraction of Jailbreaks Achieved as per the GPT4-Metric. For each method and target LLM, we report (1) the fraction of jailbreaks found on AdvBench Subset according to GPT4-Metric and (2) the number of queries sent to the target LLM in the process. For both TAP and PAIR we use Vicuna-13B-v1.5 as the attacker. The best result for each model is bolded. The success rate of PAIR in our evaluations differs from those in [twentyQueries]; see RemarkĀ A.1. Results for GCG are as in [twentyQueries].

    Plain English Explanation

    Large language models (LLMs) like GPT-3 and ChatGPT are incredibly powerful and can write on almost any topic. However, they can also sometimes generate harmful, biased, or inappropriate content. To prevent this, the companies that create these models build in safeguards and filters to try to block users from getting the model to produce this kind of problematic output.

    Jailbreaking refers to methods that allow users to bypass these safeguards and get the model to say things it's not supposed to. This paper describes a new automated technique called TAP that can generate these jailbreak prompts. TAP uses a separate "attacker" LLM to repeatedly refine prompts until it finds one that works to jailbreak the target model. It also has a way to predict which prompts are likely to succeed, so it can avoid sending too many unsuccessful attempts to the target model.

    The key idea is that even though these LLMs have strong safeguards, there are still ways to find workarounds and get them to generate harmful content. The researchers demonstrate that TAP can jailbreak state-of-the-art LLMs like GPT-4 over 80% of the time, which is a significant improvement over previous methods. This shows that the problem of model safety and robustness is still an open challenge.

    Key Findings

    • TAP can jailbreak state-of-the-art LLMs like GPT-4-Turbo and GPT-4o over 80% of the time, a significant improvement over previous black-box jailbreaking methods.
    • TAP achieves this high success rate while using a smaller number of queries to the target LLM compared to previous methods.
    • TAP is also able to jailbreak LLMs protected by state-of-the-art safety guardrails like LlamaGuard.

    Technical Explanation

    The core idea behind TAP is to leverage an "attacker" LLM to automatically and iteratively refine candidate prompts until one of them is able to jailbreak the target LLM. TAP starts with an initial set of prompts and feeds them to the attacker LLM, which then generates a set of refined prompts. TAP then assesses the likelihood of each refined prompt successfully jailbreaking the target, and only sends the most promising ones.

    This iterative refinement and pruning process continues until TAP identifies a prompt that successfully jailbreaks the target LLM. The researchers evaluated TAP on a diverse set of state-of-the-art LLMs, including GPT-4-Turbo and GPT-4o, and found it could jailbreak over 80% of the targets, significantly outperforming previous black-box jailbreaking methods.

    Importantly, TAP was also able to jailbreak LLMs protected by advanced safety guardrails like LlamaGuard, demonstrating the continued challenge of ensuring the robustness and safety of these powerful language models.

    Critical Analysis

    The researchers acknowledge several limitations of their work. First, TAP relies on access to an "attacker" LLM, which may not always be available. Second, the success rate of TAP, while high, is still not 100%, meaning some target LLMs may prove resistant to jailbreaking. Finally, the ethical implications of developing jailbreaking techniques are concerning, as they could potentially be misused to cause harm.

    That said, the researchers argue that understanding the vulnerabilities of LLMs is an important step towards improving their safety and robustness. By demonstrating the continued ability to jailbreak even advanced models, this work highlights the ongoing challenge of creating truly secure and trustworthy language AI systems.

    Ultimately, this research underscores the need for continued innovation in the field of AI safety and the development of more robust and tamper-resistant language models that can withstand a diverse range of attacks.

    Conclusion

    This paper presents an automated method called TAP that can generate jailbreaks for state-of-the-art large language models using only black-box access. TAP leverages an attacker LLM to iteratively refine prompts and prune unsuccessful attempts, allowing it to jailbreak over 80% of target LLMs, including those protected by advanced safety guardrails.

    While concerning, this work highlights the continued challenge of ensuring the safety and robustness of powerful language AI systems. Addressing the vulnerabilities exposed by TAP will be crucial as these models become more widely deployed and integrated into critical applications.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2312.02119



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    2

    Follow @aimodelsfyi on š• ā†’