SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
- URL: http://arxiv.org/abs/2602.06854v1
- Date: Fri, 06 Feb 2026 16:44:57 GMT
- Title: SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
- Authors: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao,
- Abstract summary: We propose a framework that trains a multi-turn attacker without relying on any existing strategies or external data.<n>Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts.<n>We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.
- Score: 53.97948802255959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
Related papers
- RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z) - Multi-Turn Jailbreaks Are Simpler Than They Seem [3.6010884750431438]
Multi-turn jailbreak attacks achieve success rates exceeding 70% against models optimized for single-turn protection.<n>Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems.
arXiv Detail & Related papers (2025-08-11T05:57:41Z) - GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication [55.63412213263305]
Large Language Models pose notable safety risks due to potential misuse for malicious purposes.<n>We propose a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction.<n>In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs.
arXiv Detail & Related papers (2025-06-22T03:15:05Z) - M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs [8.91993614197627]
We introduce a novel framework for consolidating multi-turn adversarial jailbreak'' prompts into single-turn queries.<n>Our multi-turn-to-single-turn (M2S) methods systematically reformat multi-turn dialogues into structured single-turn prompts.<n>Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points.
arXiv Detail & Related papers (2025-03-06T07:34:51Z) - Foot-In-The-Door: A Multi-turn Jailbreak for LLMs [40.958137601841734]
A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs.<n>Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method.<n>Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses.
arXiv Detail & Related papers (2025-02-27T06:49:16Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors [12.550678408719756]
We propose a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors.<n>Experiments demonstrate that Siren achieves an attack success rate (ASR) of 90% with LLaMA-3-8B as the attacker.<n>We hope Siren inspires the development of stronger defenses against advanced multi-turn jailbreak attacks.
arXiv Detail & Related papers (2025-01-24T05:31:27Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Weak-to-Strong Jailbreaking on Large Language Models [92.52448762164926]
Large language models (LLMs) are vulnerable to jailbreak attacks.<n>Existing jailbreaking methods are computationally costly.<n>We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.