Related papers: Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets

Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets

URL: http://arxiv.org/abs/2512.16030v1
Date: Wed, 17 Dec 2025 23:23:06 GMT
Title: Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets
Authors: Lukas Nel,
Abstract summary: A well-calibrated model should express confidence that matches its actual accuracy -- when it claims 80% confidence, it should be correct 80% of the time.<n>We introduce textbfKalshiBench, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange.<n>We evaluate five frontier models -- Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 -- and find textbfsystematic overconfidence across all models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A well-calibrated model should express confidence that matches its actual accuracy -- when it claims 80\% confidence, it should be correct 80\% of the time. While large language models (LLMs) have achieved remarkable performance across diverse tasks, their epistemic calibration remains poorly understood. We introduce \textbf{KalshiBench}, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange, with verifiable real-world outcomes occurring after model training cutoffs. Unlike traditional benchmarks measuring accuracy on static knowledge, KalshiBench evaluates whether models can appropriately quantify uncertainty about genuinely unknown future events. We evaluate five frontier models -- Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 -- and find \textbf{systematic overconfidence across all models}. Even the best-calibrated model (Claude Opus 4.5, ECE=0.120) shows substantial calibration errors, while reasoning-enhanced models like GPT-5.2-XHigh exhibit \emph{worse} calibration (ECE=0.395) despite comparable accuracy. Critically, only one model achieves a positive Brier Skill Score, indicating most models perform worse than simply predicting base rates. Our findings suggest that scaling and enhanced reasoning do not automatically confer calibration benefits, highlighting epistemic calibration as a distinct capability requiring targeted development.

Related papers

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning [32.32593439144886]
Behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification.<n>Our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation.
arXiv Detail & Related papers (2025-12-22T22:51:48Z)
Calibrating the Confidence of Large Language Models by Eliciting Fidelity [52.47397325111864]
Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. Post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. We propose a plug-and-play method to estimate the confidence of language models.
arXiv Detail & Related papers (2024-04-03T11:36:12Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models. We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z)
Calibration in Deep Learning: A Survey of the State-of-the-Art [3.8305839404070223]
Calibrating deep neural models plays an important role in building reliable, robust AI systems in safety-critical applications.<n>Recent work has shown that modern neural networks that possess high predictive capability are poorly calibrated and produce unreliable model predictions.
arXiv Detail & Related papers (2023-08-02T15:28:10Z)
Proximity-Informed Calibration for Deep Neural Networks [49.330703634912915]
ProCal is a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity. We show that ProCal is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings.
arXiv Detail & Related papers (2023-06-07T16:40:51Z)
Learning Sample Difficulty from Pre-trained Models for Reliable Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z)
Calibration Meets Explanation: A Simple and Effective Approach for Model Confidence Estimates [21.017890579840145]
We propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions. We conduct extensive experiments on six datasets with two popular pre-trained language models. Our findings highlight that model explanations can help calibrate posterior estimates.
arXiv Detail & Related papers (2022-11-06T06:17:21Z)
Revisiting Calibration for Question Answering [16.54743762235555]
We argue that the traditional evaluation of calibration does not reflect usefulness of the model confidence. We propose a new calibration metric, MacroCE, that better captures whether the model assigns low confidence to wrong predictions and high confidence to correct predictions.
arXiv Detail & Related papers (2022-05-25T05:49:56Z)
T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z)
Calibrated and Sharp Uncertainties in Deep Learning via Density Estimation [10.209143402485406]
This paper argues that calibration is important in practice and is easy to maintain.<n>We introduce a simple training procedure based on recalibration that yields calibrated models without sacrificing overall performance.
arXiv Detail & Related papers (2021-12-14T06:19:05Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.