AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
- URL: http://arxiv.org/abs/2510.05379v2
- Date: Wed, 08 Oct 2025 04:37:35 GMT
- Title: AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
- Authors: Xiaogeng Liu, Chaowei Xiao,
- Abstract summary: AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch.<n>While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt.<n>We propose to further improve the attack performance of AutoDAN-Turbo through test-time scaling.
- Score: 54.47844626555395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in jailbreaking large language models (LLMs), such as AutoDAN-Turbo, have demonstrated the power of automated strategy discovery. AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch. While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt, which may not fully exploit the potential of the learned strategy library. In this paper, we propose to further improve the attack performance of AutoDAN-Turbo through test-time scaling. We introduce two distinct scaling methods: Best-of-N and Beam Search. The Best-of-N method generates N candidate attack prompts from a sampled strategy and selects the most effective one based on a scorer model. The Beam Search method conducts a more exhaustive search by exploring combinations of strategies from the library to discover more potent and synergistic attack vectors. According to the experiments, the proposed methods significantly boost performance, with Beam Search increasing the attack success rate by up to 15.6 percentage points on Llama-3.1-70B-Instruct and achieving a nearly 60% relative improvement against the highly robust GPT-o4-mini compared to the vanilla method.
Related papers
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance [86.46794021499511]
We show a previously underexplored gap between strategy usage and strategy executability.<n>We propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability.<n> SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance.
arXiv Detail & Related papers (2026-02-26T03:34:23Z) - An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks [9.715575204912167]
We propose a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies.<n>ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
arXiv Detail & Related papers (2025-11-04T08:24:22Z) - Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [54.22256089592864]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute.<n>Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z) - Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z) - Golden Ratio Search: A Low-Power Adversarial Attack for Deep Learning based Modulation Classification [8.187445866881637]
We propose a minimal power white box adversarial attack for Deep Learning based Automatic Modulation Classification (AMC)
We evaluate the efficacy of the proposed method by comparing it with existing adversarial attack approaches.
Experimental results demonstrate that the proposed attack is powerful, requires minimal power, and can be generated in less time.
arXiv Detail & Related papers (2024-09-17T17:17:54Z) - A Multi-objective Memetic Algorithm for Auto Adversarial Attack
Optimization Design [1.9100854225243937]
Well-designed adversarial defense strategies can improve the robustness of deep learning models against adversarial examples.
Given the defensed model, the efficient adversarial attack with less computational burden and lower robust accuracy is needed to be further exploited.
We propose a multi-objective memetic algorithm for auto adversarial attack optimization design, which realizes the automatical search for the near-optimal adversarial attack towards defensed models.
arXiv Detail & Related papers (2022-08-15T03:03:05Z) - LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness.
Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z) - Projective Ranking-based GNN Evasion Attacks [52.85890533994233]
Graph neural networks (GNNs) offer promising learning methods for graph-related tasks.
GNNs are at risk of adversarial attacks.
arXiv Detail & Related papers (2022-02-25T21:52:09Z) - Stealthy and Efficient Adversarial Attacks against Deep Reinforcement
Learning [30.46580767540506]
We introduce two novel adversarial attack techniques to emphstealthily and emphefficiently attack the Deep Reinforcement Learning agents.
The first technique is the emphcritical point attack: the adversary builds a model to predict the future environmental states and agent's actions, assesses the damage of each possible attack strategy, and selects the optimal one.
The second technique is the emphantagonist attack: the adversary automatically learns a domain-agnostic model to discover the critical moments of attacking the agent in an episode.
arXiv Detail & Related papers (2020-05-14T16:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.