Balancing Classification and Calibration Performance in Decision-Making LLMs via Calibration Aware Reinforcement Learning
- URL: http://arxiv.org/abs/2601.13284v1
- Date: Mon, 19 Jan 2026 18:31:31 GMT
- Title: Balancing Classification and Calibration Performance in Decision-Making LLMs via Calibration Aware Reinforcement Learning
- Authors: Duygu Nur Yaldiz, Evangelia Spiliopoulou, Zheng Qi, Siddharth Varia, Srikanth Doss, Nikolaos Pappas,
- Abstract summary: Well-calibrated confidence enables downstream systems to decide when to trust a model and when to defer to fallback mechanisms.<n>We show that while RLVR improves task performance, it produces extremely overconfident models.<n>We propose a calibration-aware reinforcement learning formulation that directly adjusts decision-token probabilities.
- Score: 10.123352394689134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly deployed in decision-making tasks, where not only accuracy but also reliable confidence estimates are essential. Well-calibrated confidence enables downstream systems to decide when to trust a model and when to defer to fallback mechanisms. In this work, we conduct a systematic study of calibration in two widely used fine-tuning paradigms: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). We show that while RLVR improves task performance, it produces extremely overconfident models, whereas SFT yields substantially better calibration, even under distribution shift, though with smaller performance gains. Through targeted experiments, we diagnose RLVR's failure, showing that decision tokens act as extraction steps of the decision in reasoning traces and do not carry confidence information, which prevents reinforcement learning from surfacing calibrated alternatives. Based on this insight, we propose a calibration-aware reinforcement learning formulation that directly adjusts decision-token probabilities. Our method preserves RLVR's accuracy level while mitigating overconfidence, reducing ECE scores up to 9 points.
Related papers
- On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z) - ConfTuner: Training Large Language Models to Express Their Confidence Verbally [58.63318088243125]
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare.<n>LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence"
arXiv Detail & Related papers (2025-08-26T09:25:32Z) - Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty [59.97939500426759]
This paper describes RLCR, an approach to training reasoning models that jointly improves accuracy and confidence estimation.<n>We show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy.<n>We also demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration.
arXiv Detail & Related papers (2025-07-22T17:56:01Z) - Know What You Don't Know: Uncertainty Calibration of Process Reward Models [6.091078936502421]
Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms.<n>PRMs tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer.<n>We present a calibration approach that adjusts PRM outputs to better align with true success probabilities.
arXiv Detail & Related papers (2025-06-11T02:39:26Z) - Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration [34.52946891778497]
Deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains.<n>They often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare.<n>Recent research has started to improve model calibration from the view of the classifier.
arXiv Detail & Related papers (2025-04-14T09:09:01Z) - Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models [15.638622371475853]
Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities.<n>We investigate the impact on confidence estimation by investigating the impact on confidence estimation.
arXiv Detail & Related papers (2025-04-03T04:39:54Z) - CARIL: Confidence-Aware Regression in Imitation Learning for Autonomous Driving [0.0]
End-to-end vision-based imitation learning has demonstrated promising results in autonomous driving.<n>Traditional approaches rely on either regressionbased models, which provide precise control but lack confidence estimation, or classification-based models, which offer confidence scores but suffer from reduced precision due to discretization.<n>We introduce a dual-head neural network architecture that integrates both regression and classification heads to improve decision reliability in imitation learning.
arXiv Detail & Related papers (2025-03-02T08:19:02Z) - Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - A Close Look into the Calibration of Pre-trained Language Models [56.998539510508515]
Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty.
We study the dynamic change in PLMs' calibration performance in training.
We extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations.
arXiv Detail & Related papers (2022-10-31T21:31:07Z) - Improving the Performance of Robust Control through Event-Triggered
Learning [74.57758188038375]
We propose an event-triggered learning algorithm that decides when to learn in the face of uncertainty in the LQR problem.
We demonstrate improved performance over a robust controller baseline in a numerical example.
arXiv Detail & Related papers (2022-07-28T17:36:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.