Fugu-MT 論文翻訳(概要): From Faithfulness to Correctness: Generative Reward Models that Think Critically

論文の概要: From Faithfulness to Correctness: Generative Reward Models that Think Critically

arxiv url: http://arxiv.org/abs/2509.25409v1
Date: Mon, 29 Sep 2025 19:06:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.281224
Title: From Faithfulness to Correctness: Generative Reward Models that Think Critically
Title（参考訳）: 信仰から正しさへ:批判的に考える生成的リワードモデル
Authors: Qiyao Ma, Yunsheng Shi, Hongtao Tian, Chao Wang, Weiming Chang, Ting Yao,
Abstract要約: 本稿では,批判的思考能力を持つ報酬モデルを実現するために,思考監督リワードモデル(TRM)を提案する。問合せ、回答、支援文書が与えられたとき、TRMはまず各回答文の忠実さを支援文書に評価し、その後、文レベルの正しさを評価するための推論ステップを適用する。
参考スコア（独自算出の注目度）: 40.07140704454647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Through reinforcement learning with verifiable rewards (RLVR), large language models have achieved substantial progress in domains with easily verifiable outcomes, such as mathematics and coding. However, when applied to more complex tasks like open-domain question answering, RLVR faces significant challenges due to the difficulty of verifying correctness. The nuanced and ambiguous nature of real-world knowledge makes it difficult to reliably evaluate correctness in these settings, necessitating further abilities that extend beyond mere logical consistency to encompass an understanding and assessment of both external and internal knowledge. Recent work has primarily focused on improving faithfulness, defined as semantic alignment with supporting documents, which can cause models to rely excessively on external sources and diminish their capacity for critical assessment. To address this, we propose the Thinking-supervised Reward Model (TRM), which incorporates sentence-level thinking supervision to endow reward models with critical thinking abilities. Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness. By structuring reward modeling as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and leverage both external and internal knowledge. Experiments on reward signals demonstrate that TRM substantially improves the identification of incorrect sentences, and incorporating TRM into policy optimization leads to significant gains in both answer correctness and usefulness.
Abstract（参考訳）: 検証可能な報酬を伴う強化学習(RLVR)を通じて、大きな言語モデルは、数学やコーディングなど、容易に検証可能な結果を持つ領域において、かなりの進歩を遂げた。しかし、オープンドメインの質問応答のようなより複雑なタスクに適用すると、RLVRは正確性を検証するのが難しいため、重大な課題に直面します。現実世界の知識の曖昧で曖昧な性質は、これらの設定における正確さを確実に評価することを難しくし、論理的一貫性を超えて、外部知識と内部知識の両方の理解と評価を包含する必要がある。最近の研究は、主に忠実性の改善に焦点を当てており、サポートドキュメントとのセマンティックアライメントとして定義されているため、モデルが外部ソースに過度に依存し、批判的評価の能力が低下する可能性がある。そこで本稿では,批判的思考能力を持つ報酬モデルを実現するために,文レベルの思考指導を取り入れた思考監督リワードモデル(TRM)を提案する。問合せ、回答、支援文書が与えられたとき、TRMはまず各回答文の忠実さを支援文書に評価し、その後、文レベルの正しさを評価するための推論ステップを適用する。報酬モデリングを忠実さ、推論、正当性評価のシーケンスとして構成することにより、TRMはモデルに対して、外部知識と内部知識の両方を批判的に評価し、活用することを奨励する。報酬信号の実験により、TRMは誤文の同定を大幅に改善し、ポリシー最適化にTRMを組み込むことで、回答の正しさと有用性の両方において大きな利益をもたらすことが示された。

論文の概要: From Faithfulness to Correctness: Generative Reward Models that Think Critically

関連論文リスト