Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
- URL: http://arxiv.org/abs/2407.16205v3
- Date: Tue, 13 Aug 2024 13:46:18 GMT
- Title: Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
- Authors: Shi Lin, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han,
- Abstract summary: We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs)
ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
- Score: 21.252514293436437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of Large Language Models (LLMs) has brought remarkable generative capabilities across diverse tasks. However, despite the impressive achievements, these LLMs still have numerous inherent vulnerabilities, particularly when faced with jailbreak attacks. By investigating jailbreak attacks, we can uncover hidden weaknesses in LLMs and inform the development of more robust defense mechanisms to fortify their security. In this paper, we further explore the boundary of jailbreak attacks on LLMs and propose Analyzing-based Jailbreak (ABJ). This effective jailbreak attack method takes advantage of LLMs' growing analyzing and reasoning capability and reveals their underlying vulnerabilities when facing analyzing-based tasks. We conduct a detailed evaluation of ABJ across various open-source and closed-source LLMs, which achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency. Our research highlights the importance of prioritizing and enhancing the safety of LLMs to mitigate the risks of misuse. The code is publicly available at hhttps://github.com/theshi-1128/ABJ-Attack. Warning: This paper contains examples of LLMs that might be offensive or harmful.
Related papers
- SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
We propose a novel jailbreak method, which utilizes the construction of input prompts by LLMs to inject jailbreak information into user prompts.
Our SIJ method achieves nearly 100% attack success rates on five well-known open-source LLMs in the context of AdvBench.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security.
We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze.
Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z) - The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models [8.423787598133972]
This paper uncovers a critical vulnerability in the function calling process of large language models (LLMs)
We introduce a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters.
Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs.
arXiv Detail & Related papers (2024-07-25T10:09:21Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [13.317364896194903]
Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner.
LLMs are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs.
arXiv Detail & Related papers (2024-06-13T17:01:40Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - Distract Large Language Models for Automatic Jailbreak Attack [8.364590541640482]
We propose a novel black-box jailbreak framework for automated red teaming of Large language models.
We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs.
arXiv Detail & Related papers (2024-03-13T11:16:43Z) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs.
Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates.
We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z) - Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [98.18718484152595]
We propose to integrate goal prioritization at both training and inference stages to counteract the intrinsic conflict between the goals of being helpful and ensuring safety.
Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety.
arXiv Detail & Related papers (2023-11-15T16:42:29Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.