Process Supervision of Confidence Margin for Calibrated LLM Reasoning
Abstract Overview
This paper introduces Reinforcement Learning with Confidence Margin (RLCM), a calibration-aware reinforcement learning framework for reasoning language models. Rather than rewarding only final-answer correctness, RLCM adds process-level supervision over intermediate reasoning prefixes using a lightweight confidence probe and a margin-based reward that encourages higher confidence on prefixes more likely to lead to correct answers than on less promising prefixes within the same trajectory. The method is built on GRPO and trained on the GRPO-LEAD dataset using the DeepSeek-R1-distilled Qwen-7B model. Across mathematical, coding, science, and logic benchmarks, the authors report improved calibration (lower ECE and PCE) while largely maintaining reasoning accuracy compared to outcome-only RL baselines. The paper further demonstrates that the resulting calibrated confidence supports downstream applications including conformal risk control with reduced token usage and confidence-weighted answer aggregation.
Novelty
The main novelty is a margin-based process reward for calibration during RL: rather than matching confidence to correctness at each step via pointwise score matching, the method ranks intermediate reasoning states by encouraging a wider confidence gap between more-solvable and less-solvable prefixes within the same trajectory. This relative objective is combined with a jointly trained lightweight MLP probe that estimates correctness probability from intermediate hidden states during on-policy training, providing process-level calibration supervision without backpropagating gradients into the policy model.
Results
On in-domain math benchmarks (MATH-500, AMC, OlympiadBench, AIME24/25) and out-of-domain tasks (LiveCodeBench, LogiQA, GPQA), RLCM achieves the lowest overall ECE (0.091) and PCE (0.036) among all compared methods while maintaining competitive accuracy (0.618 overall vs. GRPO's 0.621). Ablation studies show that process-level margin supervision outperforms both final-step-only and Brier-style variants in calibration. Downstream, RLCM's confidence enables more token-efficient conformal risk control and stronger confidence-weighted aggregation (0.748 average accuracy vs. GRPO's 0.723 and RLCR's 0.675).
Key Points
- RLCM supervises confidence throughout the reasoning trajectory using intermediate-budget prefixes and a margin-based reward that encourages a confidence gap between more-solvable and less-solvable prefixes, rather than relying only on final-answer rewards or pointwise score matching.
- On both in-domain math and out-of-domain (coding, science, logic) benchmarks, RLCM substantially reduces overconfidence and expected calibration error compared to GRPO, RLCR, and C²GSPG, while preserving competitive reasoning accuracy.
- The calibrated confidence estimates yield practical downstream benefits: more token-efficient conformal risk control for early exiting and stronger confidence-weighted answer aggregation compared to the baselines.