AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
- URL: http://arxiv.org/abs/2511.02376v2
- Date: Sun, 09 Nov 2025 05:14:14 GMT
- Title: AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
- Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban, Kevin Zhu,
- Abstract summary: AutoAdv is a training-free framework for automated multi-turn jailbreaking.<n>It achieves up to 95% attack success rate on Llama-3.1-8B within six turns.
- Score: 2.6799007584079884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
Related papers
- SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks [53.97948802255959]
We propose a framework that trains a multi-turn attacker without relying on any existing strategies or external data.<n>Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts.<n>We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.
arXiv Detail & Related papers (2026-02-06T16:44:57Z) - PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits [0.12744523252873352]
PLAGUE is a plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents.<n>We show that PLAGUE achieves state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models.
arXiv Detail & Related papers (2025-10-20T17:37:03Z) - Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks [63.803415430308114]
Current large language models are vulnerable to adversarial attacks in multi-turn interaction settings.<n>We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search.<n>Our approach achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-02T17:57:05Z) - Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z) - X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents [76.66287343106204]
X-Teaming is a framework that explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios.<n>X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model.<n>XGuard-Train is an open-source multi-turn safety training dataset that is 20x larger than the previous best resource.
arXiv Detail & Related papers (2025-04-15T16:11:28Z) - Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.