Related papers: Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

URL: http://arxiv.org/abs/2310.13014v1
Date: Tue, 17 Oct 2023 17:58:17 GMT
Title: Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
Authors: Philipp Schoenegger and Peter S. Park
Abstract summary: We enroll OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. We show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction.
Score: 2.900810893770134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of large language models to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.

Related papers

Wisdom of the Crowds in Forecasting: Forecast Summarization for Supporting Future Event Prediction [17.021220773165016]
Future Event Prediction (FEP) is an essential activity whose demand and application range across multiple domains. One forecasting way is to gather and aggregate collective opinions on the future to make predictions as cumulative perspectives carry the potential to help estimating the likelihood of upcoming events. In this work, we organize the existing research and frameworks that aim to support future event prediction based on crowd wisdom through aggregating individual forecasts.
arXiv Detail & Related papers (2025-02-12T08:35:10Z)
Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z)
Hybrid Forecasting of Geopolitical Events [71.73737011120103]
SAGE is a hybrid forecasting system that combines human and machine generated forecasts. The system aggregates human and machine forecasts weighting both for propinquity and based on assessed skill. We show that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data.
arXiv Detail & Related papers (2024-12-14T22:09:45Z)
Can Language Models Use Forecasting Strategies? [14.332379032371612]
We describe experiments using a novel dataset of real world events and associated human predictions. We find that models still struggle to make accurate predictions about the future.
arXiv Detail & Related papers (2024-06-06T19:01:42Z)
Can Base ChatGPT be Used for Forecasting without Additional Optimization? [0.0]
This study investigates whether OpenAI's ChatGPT-3.5 and ChatGPT-4 can forecast future events. We employ two prompting strategies: direct prediction and what we call future narratives. After analyzing 100 trials, we find that future narrative prompts significantly enhanced ChatGPT-4's forecasting accuracy.
arXiv Detail & Related papers (2024-04-11T00:03:03Z)
ExtremeCast: Boosting Extreme Value Prediction for Global Weather Forecast [57.6987191099507]
We introduce Exloss, a novel loss function that performs asymmetric optimization and highlights extreme values to obtain accurate extreme weather forecast. We also introduce ExBooster, which captures the uncertainty in prediction outcomes by employing multiple random samples. Our solution can achieve state-of-the-art performance in extreme weather prediction, while maintaining the overall forecast accuracy comparable to the top medium-range forecast models.
arXiv Detail & Related papers (2024-02-02T10:34:13Z)
Algorithmic Information Forecastability [0.0]
degree of forecastability is a function of only the data. oracle forecastability for predictions that are always exact, precise forecastability for errors up to a bound, and probabilistic forecastability for any other predictions.
arXiv Detail & Related papers (2023-04-21T05:45:04Z)
FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead [93.67314652898547]
We present FengWu, an advanced data-driven global medium-range weather forecast system based on Artificial Intelligence (AI) FengWu is able to accurately reproduce the atmospheric dynamics and predict the future land and atmosphere states at 37 vertical levels on a 0.25deg latitude-longitude resolution. The results suggest that FengWu can significantly improve the forecast skill and extend the skillful global medium-range weather forecast out to 10.75 days lead.
arXiv Detail & Related papers (2023-04-06T09:16:39Z)
Forecasting Future World Events with Neural Networks [68.43460909545063]
Autocast is a dataset containing thousands of forecasting questions and an accompanying news corpus. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts. We test language models on our forecasting task and find that performance is far below a human expert baseline.
arXiv Detail & Related papers (2022-06-30T17:59:14Z)
What Should I Know? Using Meta-gradient Descent for Predictive Feature Discovery in a Single Stream of Experience [63.75363908696257]
computational reinforcement learning seeks to construct an agent's perception of the world through predictions of future sensations. An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making. We introduce a meta-gradient descent process by which an agent learns what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward.
arXiv Detail & Related papers (2022-06-13T21:31:06Z)
Learning to Predict Trustworthiness with Steep Slope Loss [69.40817968905495]
We study the problem of predicting trustworthiness on real-world large-scale datasets. We observe that the trustworthiness predictors trained with prior-art loss functions are prone to view both correct predictions and incorrect predictions to be trustworthy. We propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other.
arXiv Detail & Related papers (2021-09-30T19:19:09Z)
Deep Probabilistic Koopman: Long-term time-series forecasting under periodic uncertainties [7.305019142196582]
We introduce a surprisingly simple approach that characterizes time-varying distributions and enables reasonably accurate predictions thousands of timesteps into the future. This technique, which we call Deep Probabilistic Koopman (DPK), is based on recent advances in linear Koopman operator theory. We demonstrate the long-term forecasting performance of these models on a diversity of domains, including electricity demand forecasting, atmospheric chemistry, and neuroscience.
arXiv Detail & Related papers (2021-06-10T20:22:41Z)
A generative adversarial network approach to (ensemble) weather prediction [91.3755431537592]
We use a conditional deep convolutional generative adversarial network to predict the geopotential height of the 500 hPa pressure level, the two-meter temperature and the total precipitation for the next 24 hours over Europe. The proposed models are trained on 4 years of ERA5 reanalysis data from 2015-2018 with the goal to predict the associated meteorological fields in 2019.
arXiv Detail & Related papers (2020-06-13T20:53:17Z)
Measuring Forecasting Skill from Text [15.795144936579627]
We explore connections between the language people use to describe their predictions and their forecasting skill. We present a number of linguistic metrics which are computed over text associated with people's predictions about the future. We demonstrate that it is possible to accurately predict forecasting skill using a model that is based solely on language.
arXiv Detail & Related papers (2020-06-12T19:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.