Fugu-MT 論文翻訳(概要): Confidence as a Reward: Transforming LLMs into Reward Models

論文の概要: Confidence as a Reward: Transforming LLMs into Reward Models

arxiv url: http://arxiv.org/abs/2510.13501v1
Date: Wed, 15 Oct 2025 12:51:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.673393
Title: Confidence as a Reward: Transforming LLMs into Reward Models
Title（参考訳）: リワードとしての信頼: LLMをリワードモデルに変換する
Authors: He Du, Bowen Li, Chengxing Xie, Chang Gao, Kai Chen, Dacheng Tao,
Abstract要約: Confidence-as-a-Reward (CRew) は、モデルの最終回答に対するトークンレベルの信頼を報酬のプロキシとして利用する、トレーニング不要の手法である。 CRew は MATH500 および RewardMATH ベンチマークにおいて,既存のトレーニングフリー報酬手法よりも優れていることを示す。本稿では,信頼度スコアと正当性信号を組み合わせた選好データを構成する訓練戦略であるCRew-DPOを提案する。
参考スコア（独自算出の注目度）: 54.98336080630691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models can significantly enhance the reasoning capabilities of large language models (LLMs), but they typically require extensive curated data and costly training. To mitigate these challenges, training-free approaches such as LLM-as-a-Judge leverage the intrinsic reasoning abilities of LLMs to evaluate responses, achieving promising results. Recent works have also indicated that model confidence can serve effectively as a reward metric, distinguishing between chain-of-thought (CoT) and non-CoT paths. However, the concept of using confidence as a reward has not been comprehensively studied. In this work, we systematically investigate Confidence-as-a-Reward (CRew), a simple yet powerful training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward, especially suitable for close-ended tasks. Through extensive experiments on mathematical reasoning tasks, we demonstrate that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks, and even surpasses most trained reward models. We further identify a strong correlation between CRew scores and the actual reasoning performance of the model. Additionally, we find that CRew can effectively filter high-quality training data. Building upon these insights, we propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals. Finetuning with CRew-DPO further enhances the model's judging capabilities and consistently outperforms existing self-training methods.
Abstract（参考訳）: リワードモデルは、大きな言語モデル(LLM)の推論能力を大幅に向上させるが、通常、広範囲にキュレートされたデータと高価なトレーニングを必要とする。これらの課題を緩和するために、LLM-as-a-Judgeのようなトレーニング不要なアプローチは、LLMの本質的な推論能力を活用して応答を評価し、有望な結果を達成する。近年の研究では、モデル信頼度は、チェーン・オブ・シント(CoT)と非CoTパスを区別する報奨指標として効果的に機能することが示されている。しかし、報酬として信頼を用いるという概念は包括的に研究されていない。本研究では,モデルの最終回答に対するトークンレベルの信頼を報酬のプロキシとして利用し,特にクローズドタスクに適した,シンプルで強力なトレーニングフリー手法であるConfidence-as-a-Reward(CRew)を体系的に検討する。数学的推論タスクに関する広範な実験を通じて、CRewはMATH500およびRewardMATHベンチマークにおいて既存のトレーニングなし報酬アプローチよりも優れており、最も訓練された報酬モデルよりも優れていることを実証する。さらに、クリュースコアとモデルの実際の推論性能との強い相関関係を同定する。さらに、CRewは高品質なトレーニングデータを効果的にフィルタリングできることがわかった。これらの知見に基づいて、信頼度スコアと正当性信号を組み合わせた選好データを構築する訓練戦略であるCRew-DPOを提案する。 CRew-DPOによるファインタニングにより、モデルの判断能力はさらに向上し、既存の自己学習方法よりも一貫して向上する。

論文の概要: Confidence as a Reward: Transforming LLMs into Reward Models

関連論文リスト