ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
- URL: http://arxiv.org/abs/2409.19839v4
- Date: Mon, 06 Jan 2025 16:33:58 GMT
- Title: ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
- Authors: Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock,
- Abstract summary: ForecastBench is a benchmark for evaluating the accuracy of machine learning systems.
ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission.
We display system and human scores in a public leaderboard at www.forecastbench.org.
- Score: 5.029476863820779
- License:
- Abstract: Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark ($N=200$). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (p-value $<0.01$). We display system and human scores in a public leaderboard at www.forecastbench.org.
Related papers
- PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.
This poses a significant challenge to ensuring their safe deployment.
We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.
We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z) - Hybrid Forecasting of Geopolitical Events [71.73737011120103]
SAGE is a hybrid forecasting system that combines human and machine generated forecasts.
The system aggregates human and machine forecasts weighting both for propinquity and based on assessed skill.
We show that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data.
arXiv Detail & Related papers (2024-12-14T22:09:45Z) - Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models.
We validate this approach using four standard NLP benchmarks.
We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z) - Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle [13.192628306219248]
We propose using future event prediction as a continuous evaluation method to assess Large Language Models' temporal generalization abilities.
Our benchmark, Daily Oracle, automatically generates question-answer pairs from daily news, challenging LLMs to predict "future" event outcomes.
arXiv Detail & Related papers (2024-11-13T04:20:20Z) - Can We Predict Performance of Large Models across Vision-Language Tasks? [34.27319941609499]
We propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks.
We use a sparse performance matrix $boldsymbolR$, where each entry $R_mn$ represents the performance score of the $m$-th model on the $n$-th dataset.
We demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
arXiv Detail & Related papers (2024-10-14T03:00:12Z) - Can Language Models Use Forecasting Strategies? [14.332379032371612]
We describe experiments using a novel dataset of real world events and associated human predictions.
We find that models still struggle to make accurate predictions about the future.
arXiv Detail & Related papers (2024-06-06T19:01:42Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Forecasting Future World Events with Neural Networks [68.43460909545063]
Autocast is a dataset containing thousands of forecasting questions and an accompanying news corpus.
The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts.
We test language models on our forecasting task and find that performance is far below a human expert baseline.
arXiv Detail & Related papers (2022-06-30T17:59:14Z) - MLDemon: Deployment Monitoring for Machine Learning Systems [10.074466859579571]
We propose a novel approach, MLDemon, for ML DEployment MONitoring.
MLDemon integrates both unlabeled features and a small amount of on-demand labeled examples over time to produce a real-time estimate.
On temporal datasets with diverse distribution drifts and models, MLDemon substantially outperforms existing monitoring approaches.
arXiv Detail & Related papers (2021-04-28T07:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.