The Wisdom of Deliberating AI Crowds: Does Deliberation Improve LLM-Based Forecasting?
- URL: http://arxiv.org/abs/2512.22625v1
- Date: Sat, 27 Dec 2025 15:45:21 GMT
- Title: The Wisdom of Deliberating AI Crowds: Does Deliberation Improve LLM-Based Forecasting?
- Authors: Paul Schneider, Amalie Schramm,
- Abstract summary: This study investigates whether allowing LLMs to review each other's forecasts before updating can improve accuracy in large language models.<n>Using 202 resolved binary questions from the Metaculus Q2 2025 AI Forecasting Tournament, accuracy was assessed across four scenarios.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Structured deliberation has been found to improve the performance of human forecasters. This study investigates whether a similar intervention, i.e. allowing LLMs to review each other's forecasts before updating, can improve accuracy in large language models (GPT-5, Claude Sonnet 4.5, Gemini Pro 2.5). Using 202 resolved binary questions from the Metaculus Q2 2025 AI Forecasting Tournament, accuracy was assessed across four scenarios: (1) diverse models with distributed information, (2) diverse models with shared information, (3) homogeneous models with distributed information, and (4) homogeneous models with shared information. Results show that the intervention significantly improves accuracy in scenario (2), reducing Log Loss by 0.020 or about 4 percent in relative terms (p = 0.017). However, when homogeneous groups (three instances of the same model) engaged in the same process, no benefit was observed. Unexpectedly, providing LLMs with additional contextual information did not improve forecast accuracy, limiting our ability to study information pooling as a mechanism. Our findings suggest that deliberation may be a viable strategy for improving LLM forecasting.
Related papers
- A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies [24.732452865928053]
Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions.<n>This study utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to evaluate the performance of 11 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-02-05T18:53:17Z) - Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning [32.32593439144886]
Behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification.<n>Our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation.
arXiv Detail & Related papers (2025-12-22T22:51:48Z) - The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification [74.64864354503204]
We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring.<n>We evaluate the ability of LLMs to assess time series forecast quality.<n>We present three experiments, including on both synthetic and real-world forecasting data.
arXiv Detail & Related papers (2025-12-12T21:59:53Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Analyzing the Role of Context in Forecasting with Large Language Models [17.021220773165016]
We first introduce a novel dataset of over 600 binary forecasting questions, augmented with related news articles and their concise question-related summaries.<n>We then explore the impact of input prompts with varying level of context on forecasting performance.<n>The results indicate that incorporating news articles significantly improves performance, while using few-shot examples leads to a decline in accuracy.
arXiv Detail & Related papers (2025-01-11T10:11:19Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Large Language Models are Zero-Shot Next Location Predictors [4.315451628809687]
Large Language Models (LLMs) have shown good generalization and reasoning capabilities.
LLMs can obtain accuracies up to 36.2%, a significant improvement of almost 640% when compared to other models specifically designed for human mobility.
arXiv Detail & Related papers (2024-05-31T16:07:33Z) - Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy [1.999925939110439]
We use an ensemble approach consisting of a crowd of twelve large language models (LLMs)
We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament.
We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
arXiv Detail & Related papers (2024-02-29T17:27:59Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.