Related papers: AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

URL: http://arxiv.org/abs/2503.15754v1
Date: Thu, 20 Mar 2025 00:13:04 GMT
Title: AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Authors: Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, Bo Li,
Abstract summary: This paper introduces AutoRedTeamer, a novel framework for fully automated, end-to-end red teaming against large language models (LLMs)<n>AutoRedTeamer combines a multi-agent architecture with a memory-guided attack selection mechanism to enable continuous discovery and integration of new attack vectors.<n>We demonstrate AutoRedTeamer's effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on HarmBench against Llama-3.1-70B.
Score: 40.350632196772466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) become increasingly capable, security and safety evaluation are crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and lack comprehensive coverage of emerging attack vectors. This paper introduces AutoRedTeamer, a novel framework for fully automated, end-to-end red teaming against LLMs. AutoRedTeamer combines a multi-agent architecture with a memory-guided attack selection mechanism to enable continuous discovery and integration of new attack vectors. The dual-agent framework consists of a red teaming agent that can operate from high-level risk categories alone to generate and execute test cases and a strategy proposer agent that autonomously discovers and implements new attacks by analyzing recent research. This modular design allows AutoRedTeamer to adapt to emerging threats while maintaining strong performance on existing attack vectors. We demonstrate AutoRedTeamer's effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on HarmBench against Llama-3.1-70B while reducing computational costs by 46% compared to existing approaches. AutoRedTeamer also matches the diversity of human-curated benchmarks in generating test cases, providing a comprehensive, scalable, and continuously evolving framework for evaluating the security of AI systems.

Related papers

Automatic LLM Red Teaming [18.044879441434432]
We propose a novel paradigm: training an AI to strategically break' another AI.<n>Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward.<n>This approach sets a new state-of-the-art, fundamentally reframing red teaming as a dynamic, trajectory-based process.
arXiv Detail & Related papers (2025-08-06T13:52:00Z)
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models [34.601888589730194]
This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations.<n>QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner.<n>Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs.
arXiv Detail & Related papers (2025-06-08T13:07:41Z)
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles [61.404771120828244]
This paper proposes an agentic workflow to automate and scale the red-teaming process of Large Language Models (LLMs)<n>Human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts.<n>When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.
arXiv Detail & Related papers (2025-06-01T02:18:41Z)
Capability-Based Scaling Laws for LLM Red-Teaming [71.89259138609965]
Traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem.<n>We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers.<n>We derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap.
arXiv Detail & Related papers (2025-05-26T16:05:41Z)
RedTeamLLM: an Agentic AI framework for offensive security [0.0]
We propose and evaluate RedTeamLLM, an integrated architecture with a comprehensive security model for automatization of pentest tasks.<n>RedTeamLLM follows three key steps: summarizing, reasoning and act, which embed its operational capacity.<n> Evaluation is performed through the automated resolution of a range of entry-level, but not trivial, CTF challenges.
arXiv Detail & Related papers (2025-05-11T09:19:10Z)
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions. AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z)
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z)
Multi-Objective Reinforcement Learning for Automated Resilient Cyber Defence [0.0]
Cyber-attacks pose a security threat to military command and control networks, Intelligence, Surveillance, and Reconnaissance (ISR) systems, and civilian critical national infrastructure. The use of artificial intelligence and autonomous agents in these attacks increases the scale, range, and complexity of this threat and the subsequent disruption they cause. Autonomous Cyber Defence (ACD) agents aim to mitigate this threat by responding at machine speed and at the scale required to address the problem.
arXiv Detail & Related papers (2024-11-26T16:51:52Z)
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester [8.947465706080523]
Red teaming assesses how large language models can produce content that violates norms, policies, and rules set during their safety training. Most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. We introduce Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations.
arXiv Detail & Related papers (2024-10-02T14:47:05Z)
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction [24.499874512829198]
We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
arXiv Detail & Related papers (2024-09-25T09:44:48Z)
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models [19.486685336959482]
Large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial.<n>A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming.<n>We introduce the $mathbfStextelf-mathbfEtextvolving mathbfAtextdversarial mathbfStextafetyety mathbf(SEAS)$ optimization framework, which enhances security by leveraging data generated by the model itself.
arXiv Detail & Related papers (2024-08-05T16:55:06Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal [47.40508941209001]
HarmBench is a standardized evaluation framework for automated red teaming. We conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses. We also introduce a highly efficient adversarial training method that greatly enhances robustness across a wide range of attacks.
arXiv Detail & Related papers (2024-02-06T18:59:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.