Fugu-MT 論文翻訳(概要): Calibration-Aware Policy Optimization for Reasoning LLMs

論文の概要: Calibration-Aware Policy Optimization for Reasoning LLMs

arxiv url: http://arxiv.org/abs/2604.12632v1
Date: Tue, 14 Apr 2026 12:03:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.424628
Title: Calibration-Aware Policy Optimization for Reasoning LLMs
Title（参考訳）: LLMのキャリブレーションを考慮したポリシー最適化
Authors: Ziqi Wang, Xingzhou Lou, Meiqi Wu, Zhengqi Wen, Junge Zhang,
Abstract要約: グループ相対政策最適化(GRPO)は、推論を強化するが、しばしば過信を誘発し、誤った応答が正しい応答よりも低いパープレキシティをもたらし、AUC(Area Under the Curve)で説明されているような相対的な校正を低下させる。 GRPO型アルゴリズムのこの劣化は不確実性に依存しない優位性推定に起因し、必然的にキャリブレーションによる最適化を誤っていることを最初に証明する。次に,ロジスティックなAUCサロゲート損失を理論的に一貫し,不確実性を考慮した優位性推定を可能にするグラジスティクス・アウェア・ポリシー最適化(CAPO)を提案する。
参考スコア（独自算出の注目度）: 27.83665401246145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves stable learning dynamics that jointly optimize calibration and accuracy. Experiments on multiple mathematical reasoning benchmarks show that CAPO-1.5B significantly improves calibration by up to 15% while achieving accuracy comparable to or better than GRPO, and further boosts accuracy on downstream inference-time scaling tasks by up to 5%. Moreover, when allowed to abstain under low-confidence conditions, CAPO achieves a Pareto-optimal precision-coverage trade-off, highlighting its practical value for hallucination mitigation.
Abstract（参考訳）: グループ相対政策最適化(GRPO)は、LLM推論を強化するが、しばしば過信を引き起こす。既存のアプローチでは、キャリブレーションや犠牲ゲインの精度が制限されている。 GRPO型アルゴリズムのこの劣化は不確実性に依存しない優位性推定に起因し、必然的にキャリブレーションによる最適化勾配を誤る。これにより、劣化キャリブレーションを犠牲にして精度が向上する。次に,キャリブレーション・アウェア・ポリシー最適化(CAPO)を提案する。これはロジスティックなAUCサロゲート損失を採用しており、理論的に一貫性があり、後悔の限界を認め、不確実性を認識した利点推定を可能にする。ノイズマスキング機構を更に取り入れることで、CAPOは校正と精度を共同で最適化する安定した学習力学を実現する。複数の数学的推論ベンチマークの実験により、CAPO-1.5BはGRPOに匹敵する精度を達成しつつ、キャリブレーションを最大15%改善し、下流の推論時間スケーリングタスクの精度を最大5%向上することが示された。さらに、低信頼条件下での排除を許すと、CAPOはパレート・最適精度被覆トレードオフを達成し、幻覚緩和の実用的価値を強調している。

論文の概要: Calibration-Aware Policy Optimization for Reasoning LLMs

関連論文リスト