Related papers: Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

URL: http://arxiv.org/abs/2312.04127v2
Date: Fri, 23 Feb 2024 07:32:27 GMT
Title: Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
Authors: Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin
Abstract summary: "Jailbreak Attack" is phenomenon where Large Language Models (LLMs) generate harmful responses when faced with malicious instructions. We introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses. Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions.
Score: 26.741029482196534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extensive work has been devoted to improving the safety mechanism of Large Language Models (LLMs). However, LLMs still tend to generate harmful responses when faced with malicious instructions, a phenomenon referred to as "Jailbreak Attack". In our research, we introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses. The jailbreak idea of our method is "Inherent Response Tendency Analysis" which identifies real-world instructions that can inherently induce LLMs to generate affirmation responses and the corresponding jailbreak strategy is "Real-World Instructions-Driven Jailbreak" which involves strategically splicing real-world instructions identified through the above analysis around the malicious instruction. Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions. We conduct experiments to verify the effectiveness of our jailbreak idea and the rationality of our jailbreak strategy design. Notably, our method designed a semantically coherent attack prompt, highlighting the potential risks of LLMs. Our study provides detailed insights into jailbreak attacks, establishing a foundation for the development of safer LLMs.

Related papers

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations [9.952498288063532]
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries.<n>We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense.<n>Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack.
arXiv Detail & Related papers (2025-07-08T14:45:21Z)
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.04296423547049]
Large Language Models (LLMs) are widely applied in various domains. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs.
arXiv Detail & Related papers (2025-02-16T11:43:39Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models [59.29840790102413]
Existing jailbreak attacks are primarily based on opaque optimization techniques and gradient search methods. We propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. Our results show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
arXiv Detail & Related papers (2024-12-28T07:48:57Z)
SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
We propose a novel jailbreak method, which utilizes the construction of input prompts by LLMs to inject jailbreak information into user prompts. Our SIJ method achieves nearly 100% attack success rates on five well-known open-source LLMs in the context of AdvBench.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs) ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak [62.56901628534646]
This paper focuses on jailbreaking attacks against large language models (LLMs)<n>Our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2024-05-30T12:50:32Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
Distract Large Language Models for Automatic Jailbreak Attack [8.364590541640482]
We propose a novel black-box jailbreak framework for automated red teaming of Large language models. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs.
arXiv Detail & Related papers (2024-03-13T11:16:43Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.