PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
- URL: http://arxiv.org/abs/2510.17947v2
- Date: Wed, 22 Oct 2025 01:18:53 GMT
- Title: PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
- Authors: Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar,
- Abstract summary: PLAGUE is a plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents.<n>We show that PLAGUE achieves state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models.
- Score: 0.12744523252873352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
Related papers
- SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks [53.97948802255959]
We propose a framework that trains a multi-turn attacker without relying on any existing strategies or external data.<n>Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts.<n>We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.
arXiv Detail & Related papers (2026-02-06T16:44:57Z) - Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models [33.30628603365359]
Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks.<n>We introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach.<n>We conduct comprehensive experiments against state-of-the-art models, including GPT-5 and Claude 3.7 Sonnet.
arXiv Detail & Related papers (2026-01-09T00:27:08Z) - AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [2.6799007584079884]
AutoAdv is a training-free framework for automated multi-turn jailbreaking.<n>It achieves up to 95% attack success rate on Llama-3.1-8B within six turns.
arXiv Detail & Related papers (2025-11-04T08:56:28Z) - GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication [55.63412213263305]
Large Language Models pose notable safety risks due to potential misuse for malicious purposes.<n>We propose a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction.<n>In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs.
arXiv Detail & Related papers (2025-06-22T03:15:05Z) - MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming [38.25556351567948]
textbfMulti-textbfTurn textbfSafety textbfAlignment (ourapproach) framework for securing large language models.<n>Red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts.<n> adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction.
arXiv Detail & Related papers (2025-05-22T08:22:57Z) - X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents [76.66287343106204]
X-Teaming is a framework that explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios.<n>X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model.<n>XGuard-Train is an open-source multi-turn safety training dataset that is 20x larger than the previous best resource.
arXiv Detail & Related papers (2025-04-15T16:11:28Z) - Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.