FuguReport

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

Authors Liaoyaqi Wang, Chunsheng Zuo, William Jurayj, Benjamin Van Durme, Anqi Liu
Affiliations The Johns Hopkins University
Categories Method / Reinforcement Learning / Calibration-aware RL framework, Application / LLM Reasoning / Improving LLM inference ability, Evaluation / Model Calibration / Evaluating confidence and reliability
License CC BY 4.0

Abstract Overview

This paper introduces Reinforcement Learning with Confidence Margin (RLCM), a calibration-aware reinforcement learning framework for reasoning language models. Rather than rewarding only final-answer correctness, RLCM adds process-level supervision over intermediate reasoning prefixes using a lightweight confidence probe and a margin-based reward that encourages higher confidence on prefixes more likely to lead to correct answers than on less promising prefixes within the same trajectory. The method is built on GRPO and trained on the GRPO-LEAD dataset using the DeepSeek-R1-distilled Qwen-7B model. Across mathematical, coding, science, and logic benchmarks, the authors report improved calibration (lower ECE and PCE) while largely maintaining reasoning accuracy compared to outcome-only RL baselines. The paper further demonstrates that the resulting calibrated confidence supports downstream applications including conformal risk control with reduced token usage and confidence-weighted answer aggregation.

Novelty

The main novelty is a margin-based process reward for calibration during RL: rather than matching confidence to correctness at each step via pointwise score matching, the method ranks intermediate reasoning states by encouraging a wider confidence gap between more-solvable and less-solvable prefixes within the same trajectory. This relative objective is combined with a jointly trained lightweight MLP probe that estimates correctness probability from intermediate hidden states during on-policy training, providing process-level calibration supervision without backpropagating gradients into the policy model.

Results

On in-domain math benchmarks (MATH-500, AMC, OlympiadBench, AIME24/25) and out-of-domain tasks (LiveCodeBench, LogiQA, GPQA), RLCM achieves the lowest overall ECE (0.091) and PCE (0.036) among all compared methods while maintaining competitive accuracy (0.618 overall vs. GRPO's 0.621). Ablation studies show that process-level margin supervision outperforms both final-step-only and Brier-style variants in calibration. Downstream, RLCM's confidence enables more token-efficient conformal risk control and stronger confidence-weighted aggregation (0.748 average accuracy vs. GRPO's 0.723 and RLCR's 0.675).

Key Points

  1. RLCM supervises confidence throughout the reasoning trajectory using intermediate-budget prefixes and a margin-based reward that encourages a confidence gap between more-solvable and less-solvable prefixes, rather than relying only on final-answer rewards or pointwise score matching.
  2. On both in-domain math and out-of-domain (coding, science, logic) benchmarks, RLCM substantially reduces overconfidence and expected calibration error compared to GRPO, RLCR, and C²GSPG, while preserving competitive reasoning accuracy.
  3. The calibrated confidence estimates yield practical downstream benefits: more token-efficient conformal risk control for early exiting and stronger confidence-weighted answer aggregation compared to the baselines.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.