0

0

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

    Published 11/13/2024 by Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

    Overview

    • This paper introduces a new attack called "DrAttack" that can effectively jailbreak large language models (LLMs) by decomposing and reconstructing the input prompt.
    • Jailbreaking refers to bypassing the safety constraints of an LLM to make it produce harmful or undesirable outputs.
    • The key idea of DrAttack is to split the input prompt into smaller fragments and then reconstruct it in a way that exploits vulnerabilities in the LLM's prompt processing.
    • The researchers demonstrate the effectiveness of DrAttack on several LLMs, including GPT-3, and discuss the implications for the security and trustworthiness of these powerful AI systems.

    DrAttack boosts attack success on powerful LLMs.

    1/4

    DrAttack boosts attack success on powerful LLMs.

    Original caption: Figure 1: Attack success rate (ASR) (%) of DrAttack and other prompt-based jailbreaking methods. DrAttack obtains a substantial gain of ASR on powerful LLMs (GPT, Claude, Gemini) over prior SOTA attackers.

    Attack success rate of baselines and DrAttack, as evaluated by GPT.

    1/2

    Attack type Attack methods GPT-3.5-turbo GPT-4 Claude-1 Claude-2 Gemini-pro Vicuna 7b Vicuna 13b Llama2 7b Llama2 13b
    white-box GCG Zou et al. (2023) 6 0 0 1 1 88 86 46 38
    white-box AutoDAN Liu et al. (2023b) 39 3 5 10 64 88 76 64 2
    black-box ICA Wei et al. (2023c) 1 0 0 0 0 49 81 1 0
    black-box PAIR Chao et al. (2023) 12 10 2 1 12 76 70 3 4
    black-box DeepInception Li et al. (2023) 0 1 5 5 27 29 7 6 8
    black-box ReNellm Ding et al. (2023) 48 13 49 18 48 54 47 30 44
    DrAttack (Ours) 78 63 48 27 79 82 63 50 62

    Original caption: Table 1: Attack success rate (%) (↑) of baselines and DrAttack assessed by GPT evaluation.

    Plain English Explanation

    The paper describes a new method called DrAttack that can "jailbreak" large language models (LLMs) like GPT-3. Jailbreaking refers to bypassing the safety constraints of an LLM to make it produce harmful or undesirable outputs.

    The key insight behind DrAttack is that LLMs can be vulnerable to how they process the input prompts. The researchers found that by splitting the input prompt into smaller fragments and then reconstructing it in a strategic way, they could exploit weaknesses in the LLM's prompt processing and get it to generate outputs that violate its intended safety restrictions.

    This is a significant finding because LLMs are becoming increasingly powerful and influential, but their security and trustworthiness are critical concerns. The ability to jailbreak these models through prompt manipulation demonstrates the need for more robust safety measures and better understanding of LLM vulnerabilities.

    Key Findings

    Technical Explanation

    The core idea behind DrAttack is that LLMs can be vulnerable to how they process input prompts. The researchers hypothesized that by splitting the prompt into smaller fragments and then reconstructing it in a strategic way, they could exploit weaknesses in the LLM's prompt processing and get it to generate outputs that violate its intended safety restrictions.

    To test this, the researchers conducted experiments on several LLMs, including GPT-3. They first decomposed the input prompt into smaller segments and then reconstructed it in a way designed to jailbreak the model. Their results showed that this approach was highly effective at bypassing the LLMs' safety constraints and generating undesirable outputs.

    Critical Analysis

    The paper provides a compelling demonstration of the vulnerabilities in current LLM systems and the need for more robust safety measures. The researchers' approach of exploiting prompt processing weaknesses is a novel and concerning attack vector that highlights the challenges in ensuring the trustworthiness of these powerful AI models.

    However, the paper does not explore the full scope of potential countermeasures or discuss the broader implications for the field of trustworthy machine learning. It would be valuable to see the authors address these areas in future work.

    Additionally, while the experiments on GPT-3 and other LLMs are informative, it would be helpful to understand the specific factors that contribute to the vulnerability, such as model architecture, training data, or prompt engineering techniques. This could provide more insights into how to improve the security of these systems.

    Conclusion

    This paper introduces a new attack called "DrAttack" that can effectively jailbreak large language models (LLMs) by exploiting vulnerabilities in how they process input prompts. The researchers demonstrated the effectiveness of this approach on several LLMs, including GPT-3, highlighting the need for more robust safety measures and a deeper understanding of the security challenges in deploying these powerful AI systems.

    The findings in this paper underscore the importance of ongoing research into trustworthy machine learning and the development of techniques to ensure the reliability and safety of large language models as they become increasingly influential in our society.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2402.16914



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    1

    Follow @aimodelsfyi on 𝕏 →