Fugu-MT 論文翻訳(概要): Think Twice: Branch-and-Rethink Reasoning Reward Model

論文の概要: Think Twice: Branch-and-Rethink Reasoning Reward Model

arxiv url: http://arxiv.org/abs/2510.23596v1
Date: Mon, 27 Oct 2025 17:58:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.661984
Title: Think Twice: Branch-and-Rethink Reasoning Reward Model
Title（参考訳）: Think Twice: ブランチ・アンド・リコンプリーティング・リワードモデル
Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau,
Abstract要約: 本稿では,2ターンのRMであるブランチ・アンド・リコンプリート(BR-RM)について紹介する。我々は、厳密なフォーマットチェックによる単純なバイナリ結果報酬を用いて、構造化された2ターントレース上でGRPOスタイルの強化学習を訓練する。 All-at-oncescoringinto focus, second-lookreasoning を変換することにより、BR-RMreducesjudgmentdiffusionand は微妙で連続的な誤りに対する感受性を高める。
参考スコア（独自算出の注目度）: 32.70732791642558
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.
Abstract（参考訳）: 大規模言語モデル(LLM)は、中間ステップを外部化し、追加のテスト時間計算を割り当てる思考モデルにますます依存している。対照的に、ほとんどの報酬モデル(RM)は、多くの品質次元を1ショットで1つのスカラーに圧縮し、判定拡散を誘導する設計である。本稿では,2ターンのRMであるブランチ・アンド・リコンプリート(BR-RM)について紹介する。ターン1は適応的な分岐を行い、(事実性や安全性のような)インスタンスクリティカルな次元の小さなセットを選択し、簡潔でエビデンスを求める仮説をスケッチする。 Turn 2は、これらの仮説を検証し、最も重要なことだけを精査するターゲット再読であるブランチ条件の再考を実行する。 GRPOスタイルの強化学習を、厳密なフォーマットチェックで単純なバイナリ結果報酬を用いて、構造化された2ターントレース上でトレーニングし、標準のRLHFパイプラインと互換性のあるアプローチを実現する。 All-at-oncescoringinto focus, second-lookreasoning を変換することで、BR-RMreducesjudgmentdiffusionand は実用的でスケーラブルなままながら微妙ながら連続的なエラーに対する感受性を向上させる。実験により,本モデルが様々な領域にまたがる3つの報酬モデルベンチマークにおいて,最先端の性能を達成することを示す。コードとモデルも間もなくリリースされる予定だ。

論文の概要: Think Twice: Branch-and-Rethink Reasoning Reward Model

関連論文リスト