Related papers: GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication

GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication

URL: http://arxiv.org/abs/2506.17881v2
Date: Mon, 29 Sep 2025 08:39:26 GMT
Title: GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication
Authors: Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin,
Abstract summary: Large Language Models pose notable safety risks due to potential misuse for malicious purposes.<n>We propose a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction.<n>In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs.
Score: 55.63412213263305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks. Nevertheless, they still pose notable safety risks due to potential misuse for malicious purposes. Jailbreaking, which seeks to induce models to generate harmful content through single-turn or multi-turn attacks, plays a crucial role in uncovering underlying security vulnerabilities. However, prior methods, including sophisticated multi-turn approaches, often struggle to adapt to the evolving dynamics of dialogue as interactions progress. To address this challenge, we propose \ours (JailBreaking via \textbf{G}lobally \textbf{R}efining and \textbf{A}daptively \textbf{F}abricating), a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction. In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent queries. Extensive experiments across six state-of-the-art LLMs demonstrate the superior effectiveness of our approach compared to existing single-turn and multi-turn jailbreaking methods. Our code will be released at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.

Related papers

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks [53.97948802255959]
We propose a framework that trains a multi-turn attacker without relying on any existing strategies or external data.<n>Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts.<n>We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.
arXiv Detail & Related papers (2026-02-06T16:44:57Z)
RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z)
A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks [3.8246557700763715]
We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations.<n>Our results help explain why single-turn jailbreak defenses are generally ineffective against multi-turn attacks.
arXiv Detail & Related papers (2025-06-29T23:28:55Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks [55.29301192316118]
Large language models (LLMs) are highly vulnerable to jailbreaking attacks.<n>We propose a safety steering framework grounded in safe control theory.<n>Our method achieves invariant safety at each turn of dialogue by learning a safety predictor.
arXiv Detail & Related papers (2025-02-28T21:10:03Z)
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs [40.958137601841734]
A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs.<n>Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method.<n>Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses.
arXiv Detail & Related papers (2025-02-27T06:49:16Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [35.7801861576917]
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities.<n>LLMs have been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks.<n>We propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values.
arXiv Detail & Related papers (2024-11-06T10:32:09Z)
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [8.485286811635557]
We introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods.<n>We train probes to classify successful from unsuccessful jailbreaks using the latent representations corresponding to prompt tokens.<n>This reveals that different jailbreaking strategies exploit different non-linear, non-universal features.
arXiv Detail & Related papers (2024-11-02T17:29:47Z)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [64.46372846359694]
We propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>In experiments, IDEATOR achieves a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries.<n>Building on IDEATOR's strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
We propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ)<n>ABJ comprises two independent attack paths, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms.<n>Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.