X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
- URL: http://arxiv.org/abs/2504.13203v1
- Date: Tue, 15 Apr 2025 16:11:28 GMT
- Title: X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
- Authors: Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel,
- Abstract summary: X-Teaming is a framework that explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios.<n>X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model.<n>XGuard-Train is an open-source multi-turn safety training dataset that is 20x larger than the previous best resource.
- Score: 80.6836084998329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.
Related papers
- Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [88.96201324719205]
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions.
We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles [2.5167155755957316]
Context Fusion Attack (CFA) is a contextual fusion black-box jailbreak attack method.
We show CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies.
arXiv Detail & Related papers (2024-08-08T09:18:47Z) - DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints [68.82294911302579]
We introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity.<n>Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
arXiv Detail & Related papers (2024-05-29T12:12:09Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Multi-granular Adversarial Attacks against Black-box Neural Ranking Models [111.58315434849047]
We create high-quality adversarial examples by incorporating multi-granular perturbations.
We transform the multi-granular attack into a sequential decision-making process.
Our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.
arXiv Detail & Related papers (2024-04-02T02:08:29Z) - Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games [11.873513881458747]
Red team can identify vulnerabilities by attacking Large Language Model (LLM) to attain safety.
Current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams.
Here we introduce dynamic Red Team Game (RTG) to analyze the multi-round offensive and defensive interactions between red team and blue team.
arXiv Detail & Related papers (2023-09-30T09:35:50Z) - Robust multi-agent coordination via evolutionary generation of auxiliary
adversarial attackers [23.15190337027283]
We propose Robust Multi-Agent Coordination via Generation of Auxiliary Adversarial Attackers (ROMANCE)
ROMANCE enables the trained policy to encounter diversified and strong auxiliary adversarial attacks during training, thus achieving high robustness under various policy perturbations.
The goal of quality is to minimize the ego-system coordination effect, and a novel diversity regularizer is applied to diversify the behaviors among attackers.
arXiv Detail & Related papers (2023-05-10T05:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.