Fugu-MT 論文翻訳(概要): Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

論文の概要: Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

arxiv url: http://arxiv.org/abs/2605.11712v1
Date: Tue, 12 May 2026 08:02:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.689041
Title: Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
Title（参考訳）: 安定したバリューアライメントを目指す - 一貫性のあるバリューガイダンスのための独立モジュールの導入
Authors: Wenhao Chen, Sirui Sun, Shengyuan Bai, Guojie Song,
Abstract要約: 本研究では,大きな言語モデルと人間の値とを一致させる安定値誘導変換器(SVGT)を提案する。複数のバックボーンと安全ベンチマークでの実験では、SVGTは一般的に有害なスコアを70%以上削減している。
参考スコア（独自算出の注目度）: 13.634463039790239
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.
Abstract（参考訳）: 人間の値を持つ大きな言語モデル(LLM)のアラインメントは通常、バックボーンのパラメータや表現空間を直接操作するトレーニング後または推論時ステアリングに依存する。しかし、重要なギャップがある:モデルの残留ストリームは高度に動的であり、そこでは値は脆弱で低次元の性質として存在し、本質的に一貫した値表現に必要な安定性とは相容れない。本稿では,(1)独立な値モデリング,(2)バックボーンから分離された専用値空間における規範的表現の維持,(2)これらの安定な信号を学習可能な遅延ブリッジトークンに変換する,という2つの主要な設計を取り入れた独立値モジュールを通じて,このギャップに対処する安定値誘導変換器(SVGT)を提案する。これらのトークンは動的値アンカーとして機能し、生成軌跡を明示的に制御し、バックボーンの内部表現を乱すことなく、様々な文脈における堅牢な定着を保証する。複数のバックボーンと安全ベンチマークでの実験では、SVGTは一般的に、生成の流速を維持しながら有害なスコアを70%以上削減し、アーキテクチャ上の基盤となる価値モデリングの有効性を実証している。私たちのコードはhttps://github.com/Clervils/SVGT.git.comで利用可能です。

論文の概要: Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

関連論文リスト