Fugu-MT 論文翻訳(概要): SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

論文の概要: SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

arxiv url: http://arxiv.org/abs/2509.13450v1
Date: Tue, 16 Sep 2025 18:36:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-18 18:41:50.612484
Title: SteeringControl: Holistic Evaluation of Alignment Steering in LLMs
Title（参考訳）: ステアリング制御:LLMにおけるアライメントステアリングの全体的評価
Authors: Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang,
Abstract要約: 我々は、コアアライメントの目的に対して表現ステアリング手法を評価するベンチマークであるSteeringControlを紹介する。ステアリングの有効性と行動の絡み合いを評価するため,安全関連一次行動と二次行動のデータセットを収集した。 Qwen-2.5-7B と Llama-3.1-8B の試験結果から, 操舵性能は, 操舵法, モデル, 目標行動の特定の組み合わせに依存することがわかった。
参考スコア（独自算出の注目度）: 42.189660766537536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.
Abstract（参考訳）: 本稿では,コアアライメントの目的 – バイアス,有害生成,幻覚 – にまたがる表現ステアリング手法を評価するためのベンチマークであるSteeringControlについて紹介する。事前のアライメント作業では、表現ステアリングの副作用を示す真理性や推論能力が強調されることが多いが、体系的な方法ではまだ理解されていない未調査のトレードオフが数多く存在する。安全関連行動と二次行動のデータセットを収集し, ステアリングの有効性と5つの一般的なステアリング法を中心とした行動の絡み合いを評価する。これを実現するために、多くの既存メソッドのビルディングブロックとして機能するユニークなコンポーネントに基づいたモジュラー・ステアリング・フレームワークを構築します。 Qwen-2.5-7B と Llama-3.1-8B の研究では, 操舵法, モデル, 目標動作の特定の組み合わせに強い操舵性能が依存しており, 厳密な概念の絡み合いもこれら3つの組み合わせの貧弱な組み合わせから生じることが判明した。コードについては、https://github.com/wang-research-lab/SteeringControl.git.comで公開しています。

論文の概要: SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

関連論文リスト