Safety Alignment of LMs via Non-cooperative Games
- URL: http://arxiv.org/abs/2512.20806v1
- Date: Tue, 23 Dec 2025 22:13:14 GMT
- Title: Safety Alignment of LMs via Non-cooperative Games
- Authors: Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov,
- Abstract summary: Current approaches rely on sequential adversarial training, generating adversarial prompts and fine-tuning LMs to defend against them.<n>We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly.<n>Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking.
- Score: 51.83432183158595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.
Related papers
- MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety [28.246225272659917]
This paper introduces textbfMAGIC, a novel multi-turn multi-agent reinforcement learning framework.<n>It formulates Large Language Models safety alignment as an adversarial asymmetric game.<n>Our framework demonstrates superior defense success rates without compromising the helpfulness of the model.
arXiv Detail & Related papers (2026-02-02T02:12:28Z) - PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense [8.817401748976375]
The FlipIt game provides a framework for modeling interactions between a defender and an advanced adversary.<n>In FlipIt, the attacker and defender compete to control a shared resource by performing a foundational action and paying a cost.<n>We introduce PoolFlip, a multi-agent gym environment that extends the FlipIt game to allow efficient learning for attackers and defenders.
arXiv Detail & Related papers (2025-08-27T00:18:49Z) - Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [64.47869632167284]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z) - MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming [38.25556351567948]
textbfMulti-textbfTurn textbfSafety textbfAlignment (ourapproach) framework for securing large language models.<n>Red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts.<n> adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction.
arXiv Detail & Related papers (2025-05-22T08:22:57Z) - Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks [59.300698230887114]
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses.<n>We propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
arXiv Detail & Related papers (2025-02-28T21:10:03Z) - Evaluating Defences against Unsafe Feedback in RLHF [26.872318173182414]
This paper looks at learning from unsafe feedback with reinforcement learning.<n>We find that safety-aligned LLMs easily explore unsafe action spaces via generating harmful text.<n>In order to protect against this vulnerability, we adapt a number of both "implict" and "explicit" harmful fine-tuning defences.
arXiv Detail & Related papers (2024-09-19T17:10:34Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent.
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.