How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
- URL: http://arxiv.org/abs/2510.23358v1
- Date: Mon, 27 Oct 2025 14:08:27 GMT
- Title: How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
- Authors: Sheri Osborn, Rohit Valecha, H. Raghav Rao, Dan Sass, Anthony Rios,
- Abstract summary: This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand.<n>Our benchmark combines two datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption.<n>Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends.
- Score: 5.848712585343904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.
Related papers
- Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs [21.656551146954587]
Large Language Models (LLMs) offer a path to automation.<n>We introduce a novel, structured dataset from 190 corporate reports.<n>Our results reveal a clear performance gap between qualitative and quantitative tasks.
arXiv Detail & Related papers (2025-12-30T15:28:03Z) - Benchmarking LLM Agents for Wealth-Management Workflows [0.0]
This dissertation extends TheAgentCompany with a finance-focused environment.<n>It investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically.
arXiv Detail & Related papers (2025-12-01T21:56:21Z) - Can Online GenAI Discussion Serve as Bellwether for Labor Market Shifts? [62.386835769570006]
This paper examines whether online discussions about Large Language Models can function as early indicators of labor market shifts.<n>We employ four distinct analytical approaches to identify the domains and timeframes in which public discourse serves as a leading signal for employment changes.<n>Our findings reveal that discussion intensity predicts employment changes 1-7 months in advance across multiple indicators, including job postings, net hiring rates, tenure patterns, and unemployment duration.
arXiv Detail & Related papers (2025-11-20T04:18:25Z) - LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena [25.304644327116975]
Large language models (LLMs) are trained on Internet-scale data to forecast future events.<n>This paper systematically investigates such predictive intelligence of LLMs.<n>We uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet.
arXiv Detail & Related papers (2025-10-20T15:20:05Z) - FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction [92.7392863957204]
FutureX is the largest and most diverse live benchmark for future prediction.<n>It supports real-time daily updates and eliminates data contamination through an automated pipeline for question gathering and answer collection.<n>We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools.
arXiv Detail & Related papers (2025-08-16T08:54:08Z) - Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z) - Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth [21.672923905771576]
Large language models (LLMs) by crowdsourcing workers pose a challenge to datasets intended to reflect human input.<n>We propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion.
arXiv Detail & Related papers (2025-06-08T04:38:39Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence.<n>WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.