Related papers: The Strategic Foresight of LLMs: Evidence from a Fully Prospective Venture Tournament

The Strategic Foresight of LLMs: Evidence from a Fully Prospective Venture Tournament

URL: http://arxiv.org/abs/2602.01684v1
Date: Mon, 02 Feb 2026 05:52:16 GMT
Title: The Strategic Foresight of LLMs: Evidence from a Fully Prospective Venture Tournament
Authors: Felipe A. Csaszar, Aticus Peterson, Daniel Wilde,
Abstract summary: We benchmarked forecasts against 346 experienced managers recruited via Prolific and three MBA-trained investors working under monitored conditions.<n>The results are striking: human evaluators achieved rank correlations with actual outcomes between 0.04 and 0.45, while several frontier LLMs exceeded 0.60, with the best (Gemini 2.5 Pro) reaching 0.74.<n>Neither wisdom-of-the-crowd ensembles nor human-AI hybrid teams outperformed the best standalone model.
Score: 0.19116784879310025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Can artificial intelligence outperform humans at strategic foresight -- the capacity to form accurate judgments about uncertain, high-stakes outcomes before they unfold? We address this question through a fully prospective prediction tournament using live Kickstarter crowdfunding projects. Thirty U.S.-based technology ventures, launched after the training cutoffs of all models studied, were evaluated while fundraising remained in progress and outcomes were unknown. A diverse suite of frontier and open-weight large language models (LLMs) completed 870 pairwise comparisons, producing complete rankings of predicted fundraising success. We benchmarked these forecasts against 346 experienced managers recruited via Prolific and three MBA-trained investors working under monitored conditions. The results are striking: human evaluators achieved rank correlations with actual outcomes between 0.04 and 0.45, while several frontier LLMs exceeded 0.60, with the best (Gemini 2.5 Pro) reaching 0.74 -- correctly ordering nearly four of every five venture pairs. These differences persist across multiple performance metrics and robustness checks. Neither wisdom-of-the-crowd ensembles nor human-AI hybrid teams outperformed the best standalone model.

Related papers

EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge [8.50639201265868]
We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples.<n>We mine boundary cases where two strong annotators conflict, using a judge to resolve labels.<n>Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points.
arXiv Detail & Related papers (2026-01-14T04:26:43Z)
Scheming Ability in LLM-to-LLM Strategic Interactions [4.873362301533824]
Large language model (LLM) agents are deployed autonomously in diverse contexts.<n>We investigate the ability and propensity of frontier LLM agents through two game-theoretic frameworks.<n>Tests four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b)
arXiv Detail & Related papers (2025-10-11T04:42:29Z)
The AI Productivity Index (APEX) [4.122962658725304]
We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value.<n>APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care.<n>We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%)
arXiv Detail & Related papers (2025-09-30T03:26:17Z)
Creativity Benchmark: A benchmark for marketing creativity for large language models [0.509780930114934]
Creativity Benchmark is an evaluation framework for large language models (LLMs) in marketing creativity.<n>The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas)
arXiv Detail & Related papers (2025-09-05T04:44:29Z)
Holistic Evaluation of Multimodal LLMs on Spatial Intelligence [81.2547965083228]
We propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence.<n>We conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens.<n>Our empirical study then reveals that GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks.
arXiv Detail & Related papers (2025-08-18T17:55:17Z)
Evaluating LLMs on Real-World Forecasting Against Expert Forecasters [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied.<n>I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters.
arXiv Detail & Related papers (2025-07-06T22:26:59Z)
SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat [76.48873047003943]
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat.<n>For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system.<n>The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration.
arXiv Detail & Related papers (2025-06-05T07:51:23Z)
Predicting Empirical AI Research Outcomes with Language Models [27.148683265085012]
Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute.<n>We build the first benchmark for this task and compare LMs with human experts.<n>We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model's cut-off date for testing.<n>We develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with.<n>In the NLP domain, our system beats human experts by a large margin (64.4% v.s. 48.
arXiv Detail & Related papers (2025-06-01T02:46:31Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
GAMA($gamma$)-Bench is a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.<n>$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.<n>Our results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
arXiv Detail & Related papers (2024-03-18T14:04:47Z)
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback [92.37243979045817]
The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks. Rather than mandating the use of LfHF techniques, we described four tasks in natural language to be accomplished in the video game Minecraft. Teams developed a diverse range of LfHF algorithms across a variety of possible human feedback types.
arXiv Detail & Related papers (2022-04-14T17:24:54Z)
Imitation Learning by Estimating Expertise of Demonstrators [92.20185160311036]
We show that unsupervised learning over demonstrator expertise can lead to a consistent boost in the performance of imitation learning algorithms. We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators. We illustrate our findings on real-robotic continuous control tasks from Robomimic and discrete environments such as MiniGrid and chess.
arXiv Detail & Related papers (2022-02-02T21:23:19Z)
Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning [134.15174177472807]
We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time. We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins.
arXiv Detail & Related papers (2020-03-28T18:28:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.