Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
- URL: http://arxiv.org/abs/2601.10589v1
- Date: Thu, 15 Jan 2026 17:00:16 GMT
- Title: Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
- Authors: Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha,
- Abstract summary: Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial jailbreak'' attacks.<n>This paper introduces Safety Self- Play (SSP), a system that acts as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests)<n>SSP autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets.
- Score: 19.431152130507648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak'' attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.
Related papers
- MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety [28.246225272659917]
This paper introduces textbfMAGIC, a novel multi-turn multi-agent reinforcement learning framework.<n>It formulates Large Language Models safety alignment as an adversarial asymmetric game.<n>Our framework demonstrates superior defense success rates without compromising the helpfulness of the model.
arXiv Detail & Related papers (2026-02-02T02:12:28Z) - Self-Guard: Defending Large Reasoning Models via enhanced self-reflection [54.775612141528164]
Self-Guard is a lightweight safety defense framework for Large Reasoning Models.<n>It bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility.<n>Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales.
arXiv Detail & Related papers (2026-01-31T13:06:11Z) - SAID: Empowering Large Language Models with Self-Activating Internal Defense [23.654016424365906]
We introduce a new, training-free defense paradigm, Self-Activating Internal Defense (SAID)<n>SAID reframes the defense task from external correction to internal capability activation.<n>It substantially outperforms state-of-the-art defenses in reducing harmful outputs.
arXiv Detail & Related papers (2025-10-23T02:07:54Z) - Adversarial Reinforcement Learning for Offensive and Defensive Agents in a Simulated Zero-Sum Network Environment [3.572219661521267]
This paper presents a controlled study of adversarial reinforcement learning in network security through a custom OpenAI Gym environment.<n>The environment captures realistic security trade-offs including background traffic noise, progressive exploitation mechanics, IP-based evasion tactics, honeypot traps, and rate-limiting defenses.
arXiv Detail & Related papers (2025-10-03T05:53:51Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [64.47869632167284]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the
Age of AI-NIDS [70.60975663021952]
We study blackbox adversarial attacks on network classifiers.
We argue that attacker-defender fixed points are themselves general-sum games with complex phase transitions.
We show that a continual learning approach is required to study attacker-defender dynamics.
arXiv Detail & Related papers (2021-11-23T23:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.