An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
- URL: http://arxiv.org/abs/2511.02356v1
- Date: Tue, 04 Nov 2025 08:24:22 GMT
- Title: An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
- Authors: Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu,
- Abstract summary: We propose a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies.<n>ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
- Score: 9.715575204912167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
Related papers
- ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses [8.57098009274006]
In-context learning (ICL) has become a powerful, data-efficient paradigm for text classification using large language models.<n>We introduce ICL-Evader, a novel black-box evasion attack framework that operates under a highly practical zero-query threat model.
arXiv Detail & Related papers (2026-01-29T11:50:50Z) - RunawayEvil: Jailbreaking the Image-to-Video Generative Models [59.21761412103083]
Image-to-Video (I2V) generation synthesizes dynamic visual content from image and text inputs, providing significant creative control.<n>We propose RunawayEvil, the first multimodal jailbreak framework for I2V models with dynamic evolutionary capability.<n>We show RunawayEvil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX.
arXiv Detail & Related papers (2025-12-07T06:14:52Z) - Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization [51.12422886183246]
Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks.<n>Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts.<n>We propose ACE-Safety, a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures.
arXiv Detail & Related papers (2025-11-24T15:23:41Z) - Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming [45.95972813586392]
Existing red-teaming approaches mainly rely on manually crafted attack strategies or static models trained offline.<n>We propose Genesis, a novel agentic framework composed of three modules: Attacker, Scorer, and Strategist.<n>Our framework discovers novel strategies and consistently outperforms existing attack baselines.
arXiv Detail & Related papers (2025-10-21T05:49:37Z) - AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling [54.47844626555395]
AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch.<n>While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt.<n>We propose to further improve the attack performance of AutoDAN-Turbo through test-time scaling.
arXiv Detail & Related papers (2025-10-06T21:16:09Z) - MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies [27.162196792311263]
Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks.<n>We propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies.<n>Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash.
arXiv Detail & Related papers (2025-08-18T16:09:57Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness.
Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z) - Projective Ranking-based GNN Evasion Attacks [52.85890533994233]
Graph neural networks (GNNs) offer promising learning methods for graph-related tasks.
GNNs are at risk of adversarial attacks.
arXiv Detail & Related papers (2022-02-25T21:52:09Z) - Robust Federated Learning with Attack-Adaptive Aggregation [45.60981228410952]
Federated learning is vulnerable to various attacks, such as model poisoning and backdoor attacks.
We propose an attack-adaptive aggregation strategy to defend against various attacks for robust learning.
arXiv Detail & Related papers (2021-02-10T04:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.