Fugu-MT 論文翻訳(概要): Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

論文の概要: Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

arxiv url: http://arxiv.org/abs/2510.23083v1
Date: Mon, 27 Oct 2025 07:36:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.489002
Title: Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
Title（参考訳）: より小さなモデル、より賢いリワード: プロセスとアウトカムリワードに対する2段階のアプローチ
Authors: Jan Niklas Groeneveld, Xi Qin, Alexander Schaefer, Yaad Oren,
Abstract要約: 本稿では,最先端の小型言語モデルが有用報酬モデルに変換できるかどうかを考察する。我々はAPPS符号化チャレンジベンチマークから得られた正当性ラベル付きコードサンプルのデータセットを構築した。この批判を用いて、複数世代にわたる最も正確なコードの検索能力を20%以上改善する。
参考スコア（独自算出の注目度）: 40.23960862004138
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
Abstract（参考訳）: 高品質なコードを生成することは、Large Language Models(LLMs)にとって依然として課題である。このタスクにおける推論モデルの進化には、報酬モデルが必須の中間ステップである。これらのモデルは結果または中間ステップを判断する。デコーダのみのトランスモデルをレグレッション層を導入し、微調整を監督することで、報酬モデルに変換することができる。モデルのサイズによってリフレクション機能が一般的に増加することは知られているが、Phi-4ファミリーのような最先端の小型言語モデルが、プロセス報酬と結果報酬の考慮をブレンドした有用報酬モデルに変換できるかどうかを考察したい。この目標を達成するため、APPS符号化チャレンジベンチマークから得られた正当性ラベル付きコードサンプルのデータセットを構築した。次に、中間出力の成功確率を推定するためにバリューヘッドモデルを訓練する。評価の結果,少人数のLLMは効果的な報酬モデルやコード評価評論家として機能し,複数の候補間の正しい解の同定に成功していることがわかった。この批判を用いて、複数世代にわたる最も正確なコードの検索能力を20%以上改善する。

論文の概要: Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

関連論文リスト