Fugu-MT 論文翻訳(概要): SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

論文の概要: SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

arxiv url: http://arxiv.org/abs/2511.07896v1
Date: Wed, 12 Nov 2025 01:27:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-12 20:17:03.530087
Title: SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
Title（参考訳）: SparseRM: スパースオートエンコーダを用いた軽量な選好モデリング
Authors: Dengcan Liu, Jiahao Li, Zheren Fu, Yi Tu, Jiajun Li, Zhendong Mao, Yongdong Zhang,
Abstract要約: リワードモデル(Reward Model、RM)は、人間の嗜好評価と誘導モデルアライメントのためのプロキシである。 Sparse Autoencoder (SAE) を利用してモデル表現に符号化された嗜好関連情報を抽出するSparseRMを提案する。 SparseRMは、トレーニング可能なパラメータの1%未満を使用しながら、ほとんどのメインストリームのRMよりも優れたパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 54.31950189922548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.
Abstract（参考訳）: Reward Model(RM)は、大規模言語モデル(LLM)のポストトレーニングにおける中核的なコンポーネントであり、人間の嗜好評価とモデルアライメントを導くためのプロキシとして機能する。しかし、大規模な嗜好アノテーションと微調整LDMの高コストに依存するため、限られたリソース下での信頼性RMのトレーニングは依然として困難である。そこで本研究では,Sparse Autoencoder(SAE)を利用して,モデル表現に符号化された嗜好関連情報を抽出し,軽量かつ解釈可能な報酬モデルの構築を可能にするSparseRMを提案する。 SparseRM はまず SAE を用いて LLM 表現を解釈可能な方向へ分解し、好みに関連する特徴をキャプチャする。表現はこれらの方向に向けて投影され、アライメントスコアを計算し、表現における各好みの特徴の強さを定量化する。単純な報酬ヘッドはこれらのスコアを集計し、好みのスコアを予測する。 3つの選好モデリングタスクの実験により、SparseRMはトレーニング可能なパラメータの1%未満を使用しながら、ほとんどのメインストリームのRMよりも優れた性能を達成することが示された。さらに、下流アライメントパイプラインにシームレスに統合することで、効率的なアライメントの可能性を強調している。

論文の概要: SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

関連論文リスト