Fugu-MT 論文翻訳(概要): Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

論文の概要: Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

arxiv url: http://arxiv.org/abs/2603.20212v1
Date: Mon, 02 Mar 2026 15:48:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.890998
Title: Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models
Title（参考訳）: Fast-Slow Thinking RM: Scalarとジェネレーティブリワードモデルの効率的な統合
Authors: Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu,
Abstract要約: 本稿では、デュアルプロセス理論にインスパイアされたハイブリッドRMアーキテクチャであるFast-Slow Thinking Reward Models (F/S-RM)を紹介する。ひとつは、スカラースコア(高速思考)としての第一段階の予測と、CoTベースの判断(スロー思考)である。 F/S-RMは、最先端モデルの相対的な性能を1.2%向上させ、トークン消費量を20.8%削減する。
参考スコア（独自算出の注目度）: 16.460841602259787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.
Abstract（参考訳）: Reward Model (RM) は、Reinforcement Learning from Human Feedback (RLHF) を通じて大規模言語モデルを調整するために重要である。ジェネレーティブ・リワード・モデル(GRM)はチェーン・オブ・シント(CoT)推論により精度が向上するが、計算コストはかなり高い。逆に、Scalar Reward Models (SRM) は効率性を提供するが、複雑なシナリオではパフォーマンスと適応性に制限がある。本稿では、デュアルプロセス理論にインスパイアされたハイブリッドRMアーキテクチャであるFast-Slow Thinking Reward Models (F/S-RM)を紹介する。ひとつは、スカラースコア(高速思考)としてのファーストツーケン予測とCoTベースの判断(スロー思考)である。 F/S-RMは、最先端モデルの相対的な性能を1.2%向上させ、トークン消費量を20.8%削減する。コードとデータは公開されます。

論文の概要: Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

関連論文リスト