Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
- URL: http://arxiv.org/abs/2504.01278v1
- Date: Wed, 02 Apr 2025 01:06:19 GMT
- Title: Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
- Authors: Si Chen, Xiao Yu, Ninareh Mehrabi, Rahul Gupta, Zhou Yu, Ruoxi Jia,
- Abstract summary: AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
- Score: 39.931442440365444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90\% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.
Related papers
- Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z) - Sustainable Self-evolution Adversarial Training [51.25767996364584]
We propose a Sustainable Self-Evolution Adversarial Training (SSEAT) framework for adversarial training defense models.
We introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples.
We also propose an adversarial data replay module to better select more diverse and key relearning data.
arXiv Detail & Related papers (2024-12-03T08:41:11Z) - Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction [24.499874512829198]
We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy.
Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
arXiv Detail & Related papers (2024-09-25T09:44:48Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme.
Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z) - Group-based Robustness: A General Framework for Customized Robustness in
the Real World [16.376584375681812]
We find that conventional metrics measuring targeted and untargeted robustness do not appropriately reflect a model's ability to withstand attacks from one set of source classes to another set of target classes.
We propose a new metric, termed group-based robustness, that complements existing metrics and is better-suited for evaluating model performance in certain attack scenarios.
We show that with comparable success rates, finding evasive samples using our new loss functions saves by a factor as large as the number of targeted classes.
arXiv Detail & Related papers (2023-06-29T01:07:12Z) - Beyond CAGE: Investigating Generalization of Learned Autonomous Network
Defense Policies [0.8785883427835897]
This work evaluates several reinforcement learning approaches implemented in the second edition of the CAGE Challenge.
We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition.
In unseen environments, all of our approaches perform worse, with varied degradation based on the type of environmental change.
arXiv Detail & Related papers (2022-11-28T17:01:24Z) - Resisting Deep Learning Models Against Adversarial Attack
Transferability via Feature Randomization [17.756085566366167]
We propose a feature randomization-based approach that resists eight adversarial attacks targeting deep learning models.
Our methodology can secure the target network and resists adversarial attack transferability by over 60%.
arXiv Detail & Related papers (2022-09-11T20:14:12Z) - LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness.
Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.