Fugu-MT 論文翻訳(概要): The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

論文の概要: The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

arxiv url: http://arxiv.org/abs/2509.12934v2
Date: Thu, 25 Sep 2025 20:31:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 18:47:02.694653
Title: The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Title（参考訳）: 配向の解剖:スパース特徴のステアリングによる選好最適化の分解
Authors: Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo,
Abstract要約: このフレームワークは、解釈可能なスパース機能を調節することで、モデル動作を操る軽量なアダプタを訓練する。この機構は, 学習後プロセスの挙動変化を近似するのに十分な原理と表現性を有することを示す。全体として、FSRLは解釈可能な制御インターフェースを提供し、フィーチャレベルでの好み最適化の圧力がどのように現れるかを診断する実用的な方法を提供している。
参考スコア（独自算出の注目度）: 1.7832672957068079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prevailing alignment methods induce opaque parameter changes, making it difficult to audit what the model truly learns. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes. Then, we apply this framework to the task of preference optimization and perform a causal analysis of the learned policy. We find that the model relies on stylistic presentation as a proxy for quality, disproportionately steering features related to style and formatting over those tied to alignment concepts like honesty. Despite exploiting this heuristic, FSRL proves to be an effective alignment method, achieving a substantial reduction in preference loss. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
Abstract（参考訳）: 一般的なアライメント手法は不透明なパラメータの変更を引き起こし、モデルが本当に何を学んでいるかを評価するのが難しくなる。そこで本稿では,FSRL(Feature Steering with Reinforcement Learning)について紹介する。まず、このメカニズムは、後学習過程の挙動変化を近似するのに十分な原理と表現性を持つことを示す。そして,この枠組みを選好最適化の課題に適用し,学習方針の因果解析を行う。このモデルは質の代用としてスタイリスティックなプレゼンテーションに依存しており、正直なようにアライメントの概念に結びついているものよりも、スタイルやフォーマッティングに関連する不公平な特徴を定式化しています。このヒューリスティックを利用しても、FSRLは効果的なアライメント法であることが証明され、選好損失を大幅に減少させる。全体として、FSRLは解釈可能な制御インターフェースを提供し、フィーチャレベルでの好み最適化の圧力がどのように現れるかを診断する実用的な方法を提供している。

論文の概要: The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

関連論文リスト