Fugu-MT 論文翻訳(概要): Towards Context-Invariant Safety Alignment for Large Language Models

論文の概要: Towards Context-Invariant Safety Alignment for Large Language Models

arxiv url: http://arxiv.org/abs/2605.20994v1
Date: Wed, 20 May 2026 10:33:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.621092
Title: Towards Context-Invariant Safety Alignment for Large Language Models
Title（参考訳）: 大規模言語モデルのための文脈不変型安全アライメントを目指して
Authors: Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang,
Abstract要約: 我々は,検証可能なプロンプトをアンカーとして扱うアンカー不変性正規化(AIR)を導入し,アンカー性能に対するオープンエンド変種のみを正規化するために,停止段階のターゲットを用いる。 AIRは、流通グループの精度を12.71%向上し、流通の一貫性を33.49%向上させ、敵のフレーミングに対する安全性の制約を堅牢にする。
参考スコア（独自算出の注目度）: 37.23800025875439
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.
Abstract（参考訳）: 嗜好に基づくポストトレーニングは、LLMを人間の意図と一致させるが、安全行動は不安定であることが多い。モデルは、標準のプロンプトにおいて有害な要求を拒否するが、同じ意図が敵の言葉でラップされた場合に従う。我々は、ロバスト安全性には文脈不変のアライメントが必要であることを示唆する。すべてのトレーニング信号が同等に信頼できるわけではないため、非分散を強制することは困難である。いくつかの迅速な変種では、検証可能なフィードバック(例:複数選択)を得ることができ、一方、オープンな変種では、通常、ノイズがあり、ゲーム可能な報奨プロキシ(例:学習した判断)に頼っている。その結果、標準対称不変量正規化器は、オープン・エンド・ロバスト性を改善するのではなく、信頼性のある変種の性能を低下させることで、コンテキスト横断の相違を低減することができる。これを解決するために、検証可能なプロンプトをアンカーとして扱うAnchor Invariance Regularization (AIR)を導入する。 AIRはプラグイン補助損失として実装され、不均一なプロンプトグルーピングを介してグループベースの選好最適化(例えばGRPO)と組み合わせられる。安全、道徳的推論、数学において、AIRは文脈不変性を改善し、分配群精度を12.71%向上し、配布外一貫性を33.49%向上させ、敵のフレーミングに対して安全上の制約を堅牢にする。

論文の概要: Towards Context-Invariant Safety Alignment for Large Language Models

関連論文リスト