Related papers: Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

URL: http://arxiv.org/abs/2402.13213v4
Date: Thu, 07 Aug 2025 01:33:12 GMT
Title: Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A
Authors: Benjamin Plaut, Nguyen X. Khanh, Tu Trinh,
Abstract summary: We study 15 large language models (LLMs) fine-tuned for chat.<n>We find that their maximum softmax probabilities (MSPs) are consistently miscalibrated on multiple-choice Q&A.<n>We show that this hypothesis holds for models which perform well on the underlying Q&A task.
Score: 0.6144680854063939
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study 15 large language models (LLMs) fine-tuned for chat and find that their maximum softmax probabilities (MSPs) are consistently miscalibrated on multiple-choice Q&A. However, those MSPs might still encode useful uncertainty information. Specifically, we hypothesized that wrong answers would be associated with smaller MSPs compared to correct answers. Via rigorous statistical testing, we show that this hypothesis holds for models which perform well on the underlying Q&A task. We also find a strong direction correlation between Q&A accuracy and MSP correctness prediction, while finding no correlation between Q&A accuracy and calibration error. This suggests that within the current fine-tuning paradigm, we can expect correctness prediction but not calibration to improve as LLM capabilities progress. To demonstrate the utility of correctness prediction, we show that when models have the option to abstain, performance can be improved by selectively abstaining based on the MSP of the initial model response, using only a small amount of labeled data to choose the MSP threshold.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering [15.39457034915546]
We present a formal problem formulation for textitReliable Audio-Visual Question Answering ($mathcalR$-AVQA), where we prefer abstention over answering incorrectly.<n>We propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of $mathcalR$-AVQA.
arXiv Detail & Related papers (2026-02-04T08:35:33Z)
Statistical Guarantees of Correctness Coverage for Medical Multiple-Choice Question Answering [0.0]
Large language models (LLMs) are increasingly deployed in real-world question-answering (QA) applications. LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. In this work, we for the first time adapt the CP framework to medical multiple-choice question-answering (MCQA) tasks.
arXiv Detail & Related papers (2025-03-07T15:22:10Z)
Calibrated Large Language Models for Binary Question Answering [49.1574468325115]
A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels.
arXiv Detail & Related papers (2024-07-01T09:31:03Z)
Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.