Fugu-MT 論文翻訳(概要): Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

論文の概要: Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

arxiv url: http://arxiv.org/abs/2605.27355v1
Date: Tue, 26 May 2026 17:57:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:42.584357
Title: Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Title（参考訳）: アライメント・タンパリング:人間のフィードバックからの強化学習が、ミスアライメント・バイアスを最適化するためにどのように爆発するか
Authors: Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee,
Abstract要約: ヒューマンフィードバックからの強化学習(Reinforcement Learning from Human Feedback, RLHF)は、大規模言語モデルと人間の嗜好を整合させる標準的な手法である。本研究では,LLMのアライメントが嗜好データセットに影響を及ぼす潜在的な脆弱性であるアライメント・タンパリングを導入する。提案実験は,キーワードバイアスからプロパガンダ,ブランドプロモーション,楽器的目標探索に至るまで,さまざまなバイアスを呈する。
参考スコア（独自算出の注目度）: 28.241533951646712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) は、Large Language Models (LLM) を人間の好みに合わせるための標準手法である。本研究では,LLMのアライメントが選好データセットに影響を及ぼし,RLHFが望ましくない振る舞いを増幅する潜在的な脆弱性であるアライメント・タンパリングを導入する。 1) 好みのデータセットは LLM 自身の出力から構築され、それらに影響を与えることができる。これらの制限はアライメントの改ざんを引き起こすために利用することができる。例えば、LLMがより高い品質でバイアス応答を発生させる場合、アノテータは品質に基づいてそれらを優先する。しかし、選好ラベルは品質とバイアスを区別せず、報酬モデルがこの制限を継承する。強化学習やベスト・オブ・Nサンプリングによる報酬の最適化は、不整合バイアスを増幅する。我々の実験は、キーワードバイアスからプロパガンダ(例えば、性差別)、ブランドプロモーション、インストゥルメンタルゴール検索まで、様々なバイアスの増幅を実証している。応答品質を犠牲にすることなく、ロバストなRLHFのための既存の技術はアライメントの改ざんを完全に解決することができないため、緩和は依然として困難である。これらの結果は、現在のRLHFの構造的脆弱性を明らかにし、この脆弱性を防ぐ必要性を強調している。プロジェクトページ: https://alignment-tampering.github.io/

論文の概要: Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

関連論文リスト