Related papers: Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

URL: http://arxiv.org/abs/2504.01278v1
Date: Wed, 02 Apr 2025 01:06:19 GMT
Title: Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
Authors: Si Chen, Xiao Yu, Ninareh Mehrabi, Rahul Gupta, Zhou Yu, Ruoxi Jia,
Abstract summary: AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
Score: 39.931442440365444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90\% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.

Related papers

Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models [34.601888589730194]
This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations.<n>QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner.<n>Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs.
arXiv Detail & Related papers (2025-06-08T13:07:41Z)
Capability-Based Scaling Laws for LLM Red-Teaming [71.89259138609965]
Traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem.<n>We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers.<n>We derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap.
arXiv Detail & Related papers (2025-05-26T16:05:41Z)
One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP [8.41221824218595]
We present UnivIntruder, a novel attack framework that relies solely on a single, publicly available CLIP model and publicly available datasets.<n>Our experiments show that our approach achieves an Attack Success Rate (ASR) of up to 85% on ImageNet and over 99% on CIFAR-10, significantly outperforming existing transfer-based methods.
arXiv Detail & Related papers (2025-05-26T11:25:00Z)
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z)
Sustainable Self-evolution Adversarial Training [51.25767996364584]
We propose a Sustainable Self-Evolution Adversarial Training (SSEAT) framework for adversarial training defense models. We introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples. We also propose an adversarial data replay module to better select more diverse and key relearning data.
arXiv Detail & Related papers (2024-12-03T08:41:11Z)
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction [24.499874512829198]
We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
arXiv Detail & Related papers (2024-09-25T09:44:48Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z)
Group-based Robustness: A General Framework for Customized Robustness in the Real World [16.376584375681812]
We find that conventional metrics measuring targeted and untargeted robustness do not appropriately reflect a model's ability to withstand attacks from one set of source classes to another set of target classes. We propose a new metric, termed group-based robustness, that complements existing metrics and is better-suited for evaluating model performance in certain attack scenarios. We show that with comparable success rates, finding evasive samples using our new loss functions saves by a factor as large as the number of targeted classes.
arXiv Detail & Related papers (2023-06-29T01:07:12Z)
Beyond CAGE: Investigating Generalization of Learned Autonomous Network Defense Policies [0.8785883427835897]
This work evaluates several reinforcement learning approaches implemented in the second edition of the CAGE Challenge. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. In unseen environments, all of our approaches perform worse, with varied degradation based on the type of environmental change.
arXiv Detail & Related papers (2022-11-28T17:01:24Z)
Resisting Deep Learning Models Against Adversarial Attack Transferability via Feature Randomization [17.756085566366167]
We propose a feature randomization-based approach that resists eight adversarial attacks targeting deep learning models. Our methodology can secure the target network and resists adversarial attack transferability by over 60%.
arXiv Detail & Related papers (2022-09-11T20:14:12Z)
LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness. Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.