Consistency Checks for Language Model Forecasters
- URL: http://arxiv.org/abs/2412.18544v2
- Date: Fri, 10 Jan 2025 01:06:06 GMT
- Title: Consistency Checks for Language Model Forecasters
- Authors: Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramèr,
- Abstract summary: We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.
We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
- Score: 54.62507816753479
- License:
- Abstract: Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based on arbitrage: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60% probability of winning the 2024 US presidential election, an arbitrageur can trade against the forecaster's predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters' ground truth Brier scores (which are only known in the future). We also release a consistency benchmark that resolves in 2028, providing a long-term evaluation tool for forecasting.
Related papers
- Hybrid Forecasting of Geopolitical Events [71.73737011120103]
SAGE is a hybrid forecasting system that combines human and machine generated forecasts.
The system aggregates human and machine forecasts weighting both for propinquity and based on assessed skill.
We show that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data.
arXiv Detail & Related papers (2024-12-14T22:09:45Z) - ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities [5.029476863820779]
ForecastBench is a benchmark for evaluating the accuracy of machine learning systems.
ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission.
We display system and human scores in a public leaderboard at www.forecastbench.org.
arXiv Detail & Related papers (2024-09-30T00:41:51Z) - Performative Prediction on Games and Mechanism Design [69.7933059664256]
We study a collective risk dilemma where agents decide whether to trust predictions based on past accuracy.
As predictions shape collective outcomes, social welfare arises naturally as a metric of concern.
We show how to achieve better trade-offs and use them for mechanism design.
arXiv Detail & Related papers (2024-08-09T16:03:44Z) - Approaching Human-Level Forecasting with Language Models [34.202996056121]
We study whether language models (LMs) can forecast at the level of competitive human forecasters.
We develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions.
arXiv Detail & Related papers (2024-02-28T18:54:18Z) - Streaming Motion Forecasting for Autonomous Driving [71.7468645504988]
We introduce a benchmark that queries future trajectories on streaming data and we refer to it as "streaming forecasting"
Our benchmark inherently captures the disappearance and re-appearance of agents, which is a safety-critical problem yet overlooked by snapshot-based benchmarks.
We propose a plug-and-play meta-algorithm called "Predictive Streamer" that can adapt any snapshot-based forecaster into a streaming forecaster.
arXiv Detail & Related papers (2023-10-02T17:13:16Z) - Regions of Reliability in the Evaluation of Multivariate Probabilistic
Forecasts [73.33395097728128]
We provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation.
We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions.
arXiv Detail & Related papers (2023-04-19T17:38:42Z) - Post-selection Inference for Conformal Prediction: Trading off Coverage
for Precision [0.0]
Traditionally, conformal prediction inference requires a data-independent specification of miscoverage level.
We develop simultaneous conformal inference to account for data-dependent miscoverage levels.
arXiv Detail & Related papers (2023-04-12T20:56:43Z) - What Should I Know? Using Meta-gradient Descent for Predictive Feature
Discovery in a Single Stream of Experience [63.75363908696257]
computational reinforcement learning seeks to construct an agent's perception of the world through predictions of future sensations.
An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making.
We introduce a meta-gradient descent process by which an agent learns what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward.
arXiv Detail & Related papers (2022-06-13T21:31:06Z) - Comparing Sequential Forecasters [35.38264087676121]
Consider two forecasters, each making a single prediction for a sequence of events over time.
How might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated?
We present novel sequential inference procedures for estimating the time-varying difference in forecast scores.
We empirically validate our approaches by comparing real-world baseball and weather forecasters.
arXiv Detail & Related papers (2021-09-30T22:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.