Related papers: AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

URL: http://arxiv.org/abs/2511.02376v2
Date: Sun, 09 Nov 2025 05:14:14 GMT
Title: AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban, Kevin Zhu,
Abstract summary: AutoAdv is a training-free framework for automated multi-turn jailbreaking.<n>It achieves up to 95% attack success rate on Llama-3.1-8B within six turns.
Score: 2.6799007584079884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

Related papers

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks [53.97948802255959]
We propose a framework that trains a multi-turn attacker without relying on any existing strategies or external data.<n>Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts.<n>We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.
arXiv Detail & Related papers (2026-02-06T16:44:57Z)
PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits [0.12744523252873352]
PLAGUE is a plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents.<n>We show that PLAGUE achieves state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models.
arXiv Detail & Related papers (2025-10-20T17:37:03Z)
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks [63.803415430308114]
Current large language models are vulnerable to adversarial attacks in multi-turn interaction settings.<n>We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search.<n>Our approach achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-02T17:57:05Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z)
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents [76.66287343106204]
X-Teaming is a framework that explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios.<n>X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model.<n>XGuard-Train is an open-source multi-turn safety training dataset that is 20x larger than the previous best resource.
arXiv Detail & Related papers (2025-04-15T16:11:28Z)
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.