Related papers: LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

URL: http://arxiv.org/abs/2510.17638v1
Date: Mon, 20 Oct 2025 15:20:05 GMT
Title: LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena
Authors: Qingchuan Yang, Simon Mahns, Sida Li, Anri Gu, Jibang Wu, Haifeng Xu,
Abstract summary: Large language models (LLMs) are trained on Internet-scale data to forecast future events.<n>This paper systematically investigates such predictive intelligence of LLMs.<n>We uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet.
Score: 25.304644327116975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.

Related papers

The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification [74.64864354503204]
We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring.<n>We evaluate the ability of LLMs to assess time series forecast quality.<n>We present three experiments, including on both synthetic and real-world forecasting data.
arXiv Detail & Related papers (2025-12-12T21:59:53Z)
How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes [5.848712585343904]
This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand.<n>Our benchmark combines two datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption.<n>Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends.
arXiv Detail & Related papers (2025-10-27T14:08:27Z)
Predicting Language Models' Success at Zero-Shot Probabilistic Prediction [23.802154124780376]
We investigate the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics.<n>We find that LLMs' performance is highly variable, both on tasks within the same dataset and across different datasets.<n>We construct metrics to predict LLMs' performance at the task level, aiming to distinguish between tasks where LLMs may perform well and where they are likely unsuitable.
arXiv Detail & Related papers (2025-09-18T18:57:05Z)
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction [92.7392863957204]
FutureX is the largest and most diverse live benchmark for future prediction.<n>It supports real-time daily updates and eliminates data contamination through an automated pipeline for question gathering and answer collection.<n>We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools.
arXiv Detail & Related papers (2025-08-16T08:54:08Z)
Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z)
On the Performance of LLMs for Real Estate Appraisal [5.812129569528997]
This study examines how Large Language Models (LLMs) can democratize access to real estate insights by generating competitive and interpretable house price estimates.<n>We evaluate leading LLMs on diverse international housing datasets, comparing zero-shot, few-shot, market report-enhanced, and hybrid prompting techniques.<n>Our results show that LLMs effectively leverage hedonic variables, such as property size and amenities, to produce meaningful estimates.
arXiv Detail & Related papers (2025-06-13T14:14:40Z)
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [54.38054999271322]
We show that large language models (LLMs) don't update their beliefs as expected from the Bayesian framework.<n>We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the normative Bayesian model.<n>More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
Large language models (LLMs) are becoming more capable and widespread.<n>Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.<n>In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
Predictive Prompt Analysis [18.90591503793723]
Large Language Models (LLMs) are machine learning models that have seen widespread adoption due to their capability of handling previously difficult tasks.<n>We argue it would be useful to perform predictive prompt analysis', in which an automated technique would perform a quick analysis of a prompt.<n>We present Syntactic Prevalence Analyzer (SPA), a predictive prompt analysis approach based on sparse autoencoders (SAEs)
arXiv Detail & Related papers (2025-01-31T04:34:43Z)
Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training.<n>Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization.<n>We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing [2.936331223824117]
Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. We analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs' stance judgments match humans'
arXiv Detail & Related papers (2024-06-11T17:26:07Z)
Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models [51.3422222472898]
We document the capability of large language models (LLMs) like ChatGPT to predict stock price movements using news headlines. We develop a theoretical model incorporating information capacity constraints, underreaction, limits-to-arbitrage, and LLMs.
arXiv Detail & Related papers (2023-04-15T19:22:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.