Fugu-MT 論文翻訳(概要): reward-lens: A Mechanistic Interpretability Library for Reward Models

論文の概要: reward-lens: A Mechanistic Interpretability Library for Reward Models

arxiv url: http://arxiv.org/abs/2604.26130v1
Date: Tue, 28 Apr 2026 21:38:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.175881
Title: reward-lens: A Mechanistic Interpretability Library for Reward Models
Title（参考訳）: reward-lens:リワードモデルのための機械的解釈可能性ライブラリ
Authors: Mohammed Suhail B Nadaf,
Abstract要約: このツールキットを報酬モデルに移植するオープンソースライブラリである reward-lens を提示する。このライブラリは、Reward Lens、コンポーネント属性、3モードアクティベーションパッチ、報奨型プローブスイート、TopK SAE機能属性を提供する。 695 RewardBench対における2つの生産報酬モデルについて検証した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector $w_r$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman $ρ= -0.256$ on Skywork, $-0.027$ on ArmoRM). The framework treats this disagreement as a property to expose, not a bug -- motivating a design that keeps observational and causal views first-class and directly comparable.
Abstract（参考訳）: すべてのRLHFトレーニング言語モデルは報酬モデルによって形作られていますが、機械論的解釈可能性ツールキット -- ログレンズ、直接ロジット属性、アクティベーションパッチ、スパースオートエンコーダ -- は、すべてのプリミティブが語彙アンエンベディングに投影されるジェネレーションLLMのために構築されました。 Rewardモデルはそれをスカラー回帰ヘッドに置き換え、各ツールを壊します。報酬ヘッドの重みベクトル$w_r$は、すべての解釈可能性問題に対する自然な軸である。このライブラリは、Reward Lens、コンポーネント属性、3モードアクティベーションパッチ、報酬ハックプローブスイート、TopK SAE機能属性、クロスモデル比較、および5つの理論基底拡張(歪みインデックス、分散認識パッチ、誤調整カスケード検出、報酬項競合分析、概念ベクトル分析)を提供する。 10メソッドのアダプタプロトコルは、Llama、Mistral、Gemma-2、ArmoRMの多目的ヘッドをカバーしており、HuggingFaceシーケンス分類モデルの汎用アダプタである。約695 RewardBenchペア間の2つの生産報酬モデルを検証する。線形帰属は因果パッチ効果を予測しない(Spearman $ρ= -0.256$ on Skywork, $-0.027$ on ArmoRM)。このフレームワークは、この不一致を、バグではなく、公開するプロパティとして扱う。

論文の概要: reward-lens: A Mechanistic Interpretability Library for Reward Models

関連論文リスト