Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games
- URL: http://arxiv.org/abs/2310.00322v5
- Date: Sun, 28 Jul 2024 09:39:01 GMT
- Title: Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games
- Authors: Chengdong Ma, Ziran Yang, Hai Ci, Jun Gao, Minquan Gao, Xuehai Pan, Yaodong Yang,
- Abstract summary: Red team can identify vulnerabilities by attacking Large Language Model (LLM) to attain safety.
Current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams.
Here we introduce dynamic Red Team Game (RTG) to analyze the multi-round offensive and defensive interactions between red team and blue team.
- Score: 11.873513881458747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The primary challenge in deploying Large Language Model (LLM) is ensuring its harmlessness. Red team can identify vulnerabilities by attacking LLM to attain safety. However, current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams. These static approaches lead to significant reductions in generation diversity, known as the mode collapse, which makes it difficult to discover the potential risks in the increasingly complex human-LLM interactions. Here we introduce dynamic Red Team Game (RTG) to comprehensively analyze the multi-round offensive and defensive interactions between red team and blue team. Furthermore, we develop a Gamified Red Team Solver (GRTS) with diversity measures to mitigate mode collapse and theoretically guarantee the convergence of approximate Nash equilibrium which results in better strategies for both teams. Empirical results demonstrate that GRTS explore diverse and implicit attacks to adaptively exploit various LLMs, surpassing the constraints of specific modes. Insightfully, the geometrical structure we unveil of the red team task aligns with the spinning top hypothesis, confirming the necessity of constructing a diverse LLM population as a promising proxy for heterogeneous human expert red-teamers. This paves the way for scalable toxicity detection and safe alignment for LLMs.
Related papers
- X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents [80.6836084998329]
X-Teaming is a framework that explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios.
X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model.
XGuard-Train is an open-source multi-turn safety training dataset that is 20x larger than the previous best resource.
arXiv Detail & Related papers (2025-04-15T16:11:28Z) - Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models [1.9574002186090496]
The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns.
Researchers have recently complemented these efforts with an offensive approach that involves red teaming.
This paper provides a concise and practical overview of the LLM red teaming literature.
arXiv Detail & Related papers (2025-03-03T17:04:22Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.
It reformulates harmful queries into benign reasoning tasks.
We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - RED QUEEN: Safeguarding Large Language Models against Concealed
Multi-Turn Jailbreaking [30.67803190789498]
We propose a new jailbreak approach, RED QUEEN ATTACK, that constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm.
Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B.
To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks.
arXiv Detail & Related papers (2024-09-26T01:24:17Z) - Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction [24.499874512829198]
We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy.
Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
arXiv Detail & Related papers (2024-09-25T09:44:48Z) - Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts [25.661444231400772]
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs)
These advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content.
We introduce Arondight, a standardized red team framework tailored specifically for VLMs.
arXiv Detail & Related papers (2024-07-21T04:37:11Z) - DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints [68.82294911302579]
We introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity.
Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
arXiv Detail & Related papers (2024-05-29T12:12:09Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Against The Achilles' Heel: A Survey on Red Teaming for Generative Models [60.21722603260243]
Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models.
We have developed the "searcher" framework to unify various automatic red teaming approaches.
arXiv Detail & Related papers (2024-03-31T09:50:39Z) - Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts [57.49685172971446]
We present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts.
Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90%.
We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity.
arXiv Detail & Related papers (2024-02-26T18:47:27Z) - Gradient-Based Language Model Red Teaming [9.972783485792885]
Red teaming is a strategy for identifying weaknesses in generative language models (LMs)
Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans.
We present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses.
arXiv Detail & Related papers (2024-01-30T01:19:25Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Cooperation or Competition: Avoiding Player Domination for Multi-Target
Robustness via Adaptive Budgets [76.20705291443208]
We view adversarial attacks as a bargaining game in which different players negotiate to reach an agreement on a joint direction of parameter updating.
We design a novel framework that adjusts the budgets of different adversaries to avoid any player dominance.
Experiments on standard benchmarks show that employing the proposed framework to the existing approaches significantly advances multi-target robustness.
arXiv Detail & Related papers (2023-06-27T14:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.