Related papers: Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

URL: http://arxiv.org/abs/2402.19379v6
Date: Mon, 22 Jul 2024 13:50:27 GMT
Title: Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Authors: Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, Philip E. Tetlock,
Abstract summary: We use an ensemble approach consisting of a crowd of twelve large language models (LLMs) We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
Score: 1.999925939110439
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.

Related papers

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied.<n>I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters.
arXiv Detail & Related papers (2025-07-06T22:26:59Z)
Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z)
Hybrid Forecasting of Geopolitical Events [71.73737011120103]
SAGE is a hybrid forecasting system that combines human and machine generated forecasts. The system aggregates human and machine forecasts weighting both for propinquity and based on assessed skill. We show that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data.
arXiv Detail & Related papers (2024-12-14T22:09:45Z)
Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation. We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources. We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
Bayesian Statistical Modeling with Predictors from LLMs [5.5711773076846365]
State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks. This raises questions about the human-likeness of LLM-derived information.
arXiv Detail & Related papers (2024-06-13T11:33:30Z)
Can Language Models Use Forecasting Strategies? [14.332379032371612]
We describe experiments using a novel dataset of real world events and associated human predictions. We find that models still struggle to make accurate predictions about the future.
arXiv Detail & Related papers (2024-06-06T19:01:42Z)
Approaching Human-Level Forecasting with Language Models [34.202996056121]
We study whether language models (LMs) can forecast at the level of competitive human forecasters. We develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions.
arXiv Detail & Related papers (2024-02-28T18:54:18Z)
AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy [3.7865171120254355]
Large language models (LLMs) match and sometimes exceed human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task.
arXiv Detail & Related papers (2024-02-12T18:14:43Z)
ExtremeCast: Boosting Extreme Value Prediction for Global Weather Forecast [57.6987191099507]
We introduce Exloss, a novel loss function that performs asymmetric optimization and highlights extreme values to obtain accurate extreme weather forecast. We also introduce ExBooster, which captures the uncertainty in prediction outcomes by employing multiple random samples. Our solution can achieve state-of-the-art performance in extreme weather prediction, while maintaining the overall forecast accuracy comparable to the top medium-range forecast models.
arXiv Detail & Related papers (2024-02-02T10:34:13Z)
Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI [0.0]
This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact.
arXiv Detail & Related papers (2023-12-12T02:28:12Z)
Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters. We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions. We make robust and efficient counterfactual predictions for both individual and average treatment effects. The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.