Related papers: M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs

M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs

URL: http://arxiv.org/abs/2503.04856v3
Date: Tue, 05 Aug 2025 08:06:33 GMT
Title: M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs
Authors: Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim,
Abstract summary: We introduce a novel framework for consolidating multi-turn adversarial jailbreak'' prompts into single-turn queries.<n>Our multi-turn-to-single-turn (M2S) methods systematically reformat multi-turn dialogues into structured single-turn prompts.<n>Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points.
Score: 8.91993614197627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a novel framework for consolidating multi-turn adversarial ``jailbreak'' prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates, they demand considerable human effort and time. Our multi-turn-to-single-turn (M2S) methods -- Hyphenize, Numberize, and Pythonize -- systematically reformat multi-turn dialogues into structured single-turn prompts. Despite removing iterative back-and-forth interactions, these prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods achieve attack success rates from 70.6 percent to 95.9 percent across several state-of-the-art LLMs. Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points while cutting token usage by more than half on average. Further analysis shows that embedding malicious requests in enumerated or code-like structures exploits ``contextual blindness'', bypassing both native guardrails and external input-output filters. By converting multi-turn conversations into concise single-turn prompts, the M2S framework provides a scalable tool for large-scale red teaming and reveals critical weaknesses in contemporary LLM defenses.

Related papers

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks [53.97948802255959]
We propose a framework that trains a multi-turn attacker without relying on any existing strategies or external data.<n>Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts.<n>We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.
arXiv Detail & Related papers (2026-02-06T16:44:57Z)
RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z)
GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication [55.63412213263305]
Large Language Models pose notable safety risks due to potential misuse for malicious purposes.<n>We propose a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction.<n>In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs.
arXiv Detail & Related papers (2025-06-22T03:15:05Z)
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs [40.958137601841734]
A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs.<n>Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method.<n>Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses.
arXiv Detail & Related papers (2025-02-27T06:49:16Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors [12.550678408719756]
We propose a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors. Experiments demonstrate that Siren achieves an attack success rate (ASR) of 90% with LLaMA-3-8B as the attacker. We hope Siren inspires the development of stronger defenses against advanced multi-turn jailbreak attacks.
arXiv Detail & Related papers (2025-01-24T05:31:27Z)
Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans. LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses. We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking [30.67803190789498]
We propose a new jailbreak approach, RED QUEEN ATTACK, that constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks.
arXiv Detail & Related papers (2024-09-26T01:24:17Z)
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet [11.83818222565186]
We show that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks.
arXiv Detail & Related papers (2024-08-27T17:33:30Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.