Capability-Oriented Training Induced Alignment Risk
- URL: http://arxiv.org/abs/2602.12124v1
- Date: Thu, 12 Feb 2026 16:13:14 GMT
- Title: Capability-Oriented Training Induced Alignment Risk
- Authors: Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang,
- Abstract summary: We investigate whether language models, when trained with reinforcement learning, will spontaneously learn to exploit flaws to maximize their reward.<n>Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety.<n>Our findings suggest that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves.
- Score: 101.37328448441208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at https://github.com/YujunZhou/Capability_Oriented_Alignment_Risk.
Related papers
- The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment [148.80266237240713]
implicit training-time safety risks are driven by a model's internal incentives and contextual background information.<n>We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types.<n>Our results identify an overlooked yet urgent safety challenge in training.
arXiv Detail & Related papers (2026-02-04T04:23:58Z) - Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.<n>This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - An Adversarial Perspective on Machine Unlearning for AI Safety [22.639683142004372]
This work challenges the fundamental differences between unlearning and traditional safety post-training.<n>We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully.<n>For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
arXiv Detail & Related papers (2024-09-26T16:32:19Z) - Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features.
backdoor attacks subtly embed malicious behaviors within the model during training.
We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z) - Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of
Foundation Models [103.71308117592963]
We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning.
In a small-scale experiment, we show MLAC can largely prevent a BERT-style model from being re-purposed to perform gender identification.
arXiv Detail & Related papers (2022-11-27T21:43:45Z) - Soft Actor-Critic with Inhibitory Networks for Faster Retraining [0.24466725954625884]
Reusing previously trained models is critical in deep reinforcement learning.
It is unclear how to acquire new skills when objectives and constraints are in conflict with previously learned skills.
We propose a novel approach using inhibitory networks to allow separate and adaptive state value evaluations.
arXiv Detail & Related papers (2022-02-07T03:10:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.