Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
- URL: http://arxiv.org/abs/2502.11054v3
- Date: Wed, 19 Feb 2025 15:36:47 GMT
- Title: Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
- Authors: Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao,
- Abstract summary: Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.
It reformulates harmful queries into benign reasoning tasks.
We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
- Score: 53.580928907886324
- License:
- Abstract: Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.
Related papers
- Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.
We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.
We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.
PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.
To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.
We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.
We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [35.7801861576917]
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities.
LLMs have been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks.
We propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values.
arXiv Detail & Related papers (2024-11-06T10:32:09Z) - Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation.
Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks.
This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z) - You Know What I'm Saying: Jailbreak Attack via Implicit Reference [22.520950422702757]
This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR)
AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context.
Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models.
arXiv Detail & Related papers (2024-10-04T18:42:57Z) - Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles [2.5167155755957316]
Context Fusion Attack (CFA) is a contextual fusion black-box jailbreak attack method.
We show CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies.
arXiv Detail & Related papers (2024-08-08T09:18:47Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks.
Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query.
We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.