Fugu-MT 論文翻訳(概要): Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

論文の概要: Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

arxiv url: http://arxiv.org/abs/2605.06785v2
Date: Tue, 12 May 2026 15:55:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 18:21:06.812664
Title: Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Title（参考訳）: 分散プロセス・リワードモデル:条件付き最適輸送による将来のリワードの校正予測
Authors: Rachel Ma, Dylan Hadfield-Menell, Kristjan Greenewald,
Abstract要約: インタイムスケーリング手法はプロセス・リワード・モデル(PRM)に依存している。本研究では, PRMの校正, 条件OT(CondOT)マップ学習 citebunne2022 の修正, 単調条件量子関数の推定における条件最適輸送の最初の利用を提案する。これにより、構造的に有効な量子的推定が得られ、任意のレベルでの信頼境界の効率的な抽出が可能となる。
参考スコア（独自算出の注目度）: 6.379494871147752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.
Abstract（参考訳）: 推論時間スケーリングの手法はプロセス・リワード・モデル(PRM)に依存している。我々は,PRMの校正や条件OT (CondOT) Map Learning \cite{bunne2022supervised} の修正,PRMによって推定される成功確率よりも単調な条件量子関数を推定するために,PRMを校正するための条件最適輸送の最初の利用を提案する。これにより、構造的に有効な量子的推定が得られ、任意のレベルでの信頼境界の効率的な抽出が可能となり、これは \cite{park2025know} のインスタンス適応スケーリング(IAS)フレームワークに統合される。我々は,中等微分問題 (MATH-500) と難解分布問題 (AIME) にまたがる数学的推論ベンチマークについて検討した。信頼性の高いランキング信号を持つPRMでは、未校正PRMの校正と量子レグレッションの両方を大幅に改善する。ダウンストリームのBest-of-N IAS性能において,本手法は一般に非校正型PRMよりも改善される。これらの結果は、PRMキャリブレーションに対する他の原則および実践的なアプローチとして条件最適輸送を確立し、構造的保証とフレキシブルな不確実性推定を提供する。

論文の概要: Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

関連論文リスト