Related papers: Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals

Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals

URL: http://arxiv.org/abs/2512.05998v1
Date: Mon, 01 Dec 2025 19:04:25 GMT
Title: Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals
Authors: Michael Todasco,
Abstract summary: We generated 100 math and logic questions with verifiable answers.<n>Three Predictor models then forecasted, for each question-baseline pair, if the baseline would answer correctly.<n>Across 5,400 predictions per condition, Incentive runs showed modestly higher accuracy.<n>"Whale" bets of 40,000+ coins were correct 99% of the time, while small bets (1,000 coins) showed only 74% accuracy.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly used to evaluate other models, yet these judgments typically lack any representation of confidence. This pilot study tests whether framing an evaluation task as a betting game (a fictional prediction market with its own LLM currency) improves forecasting accuracy and surfaces calibrated confidence signals. We generated 100 math and logic questions with verifiable answers. Six Baseline models (three current-generation, three prior-generation) answered all items. Three Predictor models then forecasted, for each question-baseline pair, if the baseline would answer correctly. Each predictor completed matched runs in two conditions: Control (simple correct/incorrect predictions) and Incentive (predictions plus wagers of 1-100,000 LLMCoin under even odds, starting from a 1,000,000 LLMCoin bankroll). Across 5,400 predictions per condition, Incentive runs showed modestly higher accuracy (81.5% vs. 79.1%, p = .089, d = 0.86) and significantly faster learning across rounds (12.0 vs. 2.9 percentage-point improvement from Round 1 to Round 4, p = .011). Most notably, stake size tracked confidence. "Whale" bets of 40,000+ coins were correct ~99% of the time, while small bets (<1,000 coins) showed only ~74% accuracy. The key finding is not that fictional money makes models smarter; accuracy gains were modest and did not reach statistical significance (p = .089) in this pilot. Rather, the betting mechanic created a legible confidence signal absent from binary yes/no outputs. This suggests that simple financial framing may help transform LLMs into risk-aware forecasters, making their internal beliefs visible and usable. The protocol offers a foundation for future work for meta-evaluation systems and what may become LLM-to-LLM prediction markets.

Related papers

CATTO: Balancing Preferences and Confidence in Language Models [4.678970068275123]
Large language models (LLMs) often make accurate next token predictions but their confidence in these predictions can be poorly calibrated.<n>We introduce a predictive calibration-aware objective that aligns predicted confidence with empirical prediction correctness.<n>We also introduce Confidence@k, a test-time scaling mechanism leveraging calibrated token probabilities for Bayes-optimal selection of output tokens.
arXiv Detail & Related papers (2026-01-30T15:43:38Z)
The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification [74.64864354503204]
We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring.<n>We evaluate the ability of LLMs to assess time series forecast quality.<n>We present three experiments, including on both synthetic and real-world forecasting data.
arXiv Detail & Related papers (2025-12-12T21:59:53Z)
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning [15.597220136913258]
LYNX is an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions.<n>We train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks.
arXiv Detail & Related papers (2025-12-05T00:04:42Z)
Outcome-based Reinforcement Learning to Predict the Future [1.4313866885019229]
We show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1.<n>The model's performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10%.
arXiv Detail & Related papers (2025-05-23T14:56:07Z)
Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.<n>We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z)
Mind the Gap: A Causal Perspective on Bias Amplification in Prediction & Decision-Making [58.06306331390586]
We introduce the notion of a margin complement, which measures how much a prediction score $S$ changes due to a thresholding operation. We show that under suitable causal assumptions, the influences of $X$ on the prediction score $S$ are equal to the influences of $X$ on the true outcome $Y$.
arXiv Detail & Related papers (2024-05-24T11:22:19Z)
Best of Many in Both Worlds: Online Resource Allocation with Predictions under Unknown Arrival Model [16.466711636334587]
Online decision-makers often obtain predictions on future variables, such as arrivals, demands, and so on. Prediction accuracy is unknown to decision-makers a priori, hence blindly following the predictions can be harmful. We develop algorithms that utilize predictions in a manner that is robust to the unknown prediction accuracy.
arXiv Detail & Related papers (2024-02-21T04:57:32Z)
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback [91.22679548111127]
A trustworthy real-world prediction system should produce well-calibrated confidence scores. We show that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities.
arXiv Detail & Related papers (2023-05-24T10:12:33Z)
Machine learning for sports betting: should model selection be based on accuracy or calibration? [0.0]
We train models on NBA data over several seasons and run betting experiments on a single season. We show that using calibration, rather than accuracy, as the basis for model selection leads to greater returns.
arXiv Detail & Related papers (2023-03-10T16:22:38Z)
Learning to Predict Trustworthiness with Steep Slope Loss [69.40817968905495]
We study the problem of predicting trustworthiness on real-world large-scale datasets. We observe that the trustworthiness predictors trained with prior-art loss functions are prone to view both correct predictions and incorrect predictions to be trustworthy. We propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other.
arXiv Detail & Related papers (2021-09-30T19:19:09Z)
Stock Price Prediction Under Anomalous Circumstances [81.37657557441649]
This paper aims to capture the movement pattern of stock prices under anomalous circumstances. We train ARIMA and LSTM models at the single-stock level, industry level, and general market level. Based on 100 companies' stock prices in the period of 2016 to 2020, the models achieve an average prediction accuracy of 98%.
arXiv Detail & Related papers (2021-09-14T18:50:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.