Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
- URL: http://arxiv.org/abs/2408.04686v1
- Date: Thu, 8 Aug 2024 09:18:47 GMT
- Title: Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
- Authors: Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, Hui Li,
- Abstract summary: Context Fusion Attack (CFA) is a contextual fusion black-box jailbreak attack method.
We show CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies.
- Score: 2.5167155755957316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observe that existing methods lack specific considerations for the role of multiturn dialogues in attack strategies, leading to semantic deviations during continuous interactions. Therefore, in this paper, we establish a theoretical foundation for multi-turn attacks by considering their support in jailbreak attacks, and based on this, propose a context-based contextual fusion black-box jailbreak attack method, named Context Fusion Attack (CFA). This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent. Through comparisons on various mainstream LLMs and red team datasets, we have demonstrated CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies, particularly showcasing significant advantages on Llama3 and GPT-4.
Related papers
- MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [36.44365630876591]
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities.
LLMs have been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks.
We propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values.
arXiv Detail & Related papers (2024-11-06T10:32:09Z) - Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation.
Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks.
This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z) - BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models [47.576957746503666]
BlackDAN is an innovative black-box attack framework with multi-objective optimization.
It generates high-quality prompts that effectively facilitate jailbreaking.
It maintains contextual relevance and minimize detectability.
arXiv Detail & Related papers (2024-10-13T11:15:38Z) - Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [13.317364896194903]
We propose a two-stage adversarial tuning framework to enhance Large Language Models' generalized defense capabilities.
In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently generate token-level adversarial prompts.
In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts.
arXiv Detail & Related papers (2024-06-07T15:37:15Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks.
Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query.
We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z) - Hijacking Large Language Models via Adversarial In-Context Learning [8.15194326639149]
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks.
Existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL.
This work introduces a novel transferable attack against ICL to address these issues.
arXiv Detail & Related papers (2023-11-16T15:01:48Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.