Related papers: Bench to the Future: A Pastcasting Benchmark for Forecasting Agents

Bench to the Future: A Pastcasting Benchmark for Forecasting Agents

URL: http://arxiv.org/abs/2506.21558v1
Date: Wed, 11 Jun 2025 16:18:40 GMT
Title: Bench to the Future: A Pastcasting Benchmark for Forecasting Agents
Authors: FutureSearch, :, Jack Wildman, Nikos I. Bosse, Daniel Hnyk, Peter Mühlbacher, Finn Hambly, Jon Evans, Dan Schwarz, Lawrence Phillips,
Abstract summary: Bench To the Future is a "pastcasting" benchmark with hundreds of high-quality questions for which the resolution is already known.<n>Results suggest that our pastcasting environment can produce results comparable to those based on forecasts using the internet on at-the-time unresolved questions.<n>We intend this to be a living benchmark, with new questions added continually to account for increasing training data cutoff dates.
Score: 0.14980193397844666
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Forecasting is a challenging task that offers a clearly measurable way to study AI systems. Forecasting requires a large amount of research on the internet, and evaluations require time for events to happen, making the development of forecasting benchmarks challenging. To date, no forecasting benchmark provides a realistic, hermetic, and repeatable environment for LLM forecasters. We introduce Bench To the Future (BTF), a "pastcasting" benchmark with hundreds of high-quality questions for which the resolution is already known. Each question is accompanied by a large offline corpus of tens of thousands of relevant web pages, enabling a way to elicit realistic "forecasts" on past events from LLMs. Results suggest that our pastcasting environment can produce results comparable to those based on forecasts using the internet on at-the-time unresolved questions. We show results benchmarking agent and chain-of-thought forecasting approaches using several LLMs, including the recently-released Claude 4 models, and demonstrate BTF's ability to track steady forecasting capability progress over time. We intend this to be a living benchmark, with new questions added continually to account for increasing training data cutoff dates. We invite researchers to contact us at hello@futuresearch.ai to utilize our benchmark or tooling for their own research.

Related papers

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation [46.3251656496956]
Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events.<n>Several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task.<n>We introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval.
arXiv Detail & Related papers (2025-04-02T08:57:42Z)
Wisdom of the Crowds in Forecasting: Forecast Summarization for Supporting Future Event Prediction [17.021220773165016]
Future Event Prediction (FEP) is an essential activity whose demand and application range across multiple domains.<n>One forecasting way is to gather and aggregate collective opinions on the future to make predictions as cumulative perspectives carry the potential to help estimating the likelihood of upcoming events.<n>In this work, we organize the existing research and frameworks that aim to support future event prediction based on crowd wisdom through aggregating individual forecasts.
arXiv Detail & Related papers (2025-02-12T08:35:10Z)
Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.<n>We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities [5.029476863820779]
ForecastBench is a benchmark for evaluating the accuracy of machine learning systems.<n>ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission.<n>We display system and human scores in a public leaderboard at www.forecastbench.org.
arXiv Detail & Related papers (2024-09-30T00:41:51Z)
F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data [65.6499834212641]
We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm. By considering domain similarities through task-specific metadata, our model improved generalization, where the excess risk decreases as the number of training tasks increases. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
arXiv Detail & Related papers (2024-06-23T21:28:50Z)
Streaming Motion Forecasting for Autonomous Driving [71.7468645504988]
We introduce a benchmark that queries future trajectories on streaming data and we refer to it as "streaming forecasting" Our benchmark inherently captures the disappearance and re-appearance of agents, which is a safety-critical problem yet overlooked by snapshot-based benchmarks. We propose a plug-and-play meta-algorithm called "Predictive Streamer" that can adapt any snapshot-based forecaster into a streaming forecaster.
arXiv Detail & Related papers (2023-10-02T17:13:16Z)
Forecasting Future World Events with Neural Networks [68.43460909545063]
Autocast is a dataset containing thousands of forecasting questions and an accompanying news corpus. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts. We test language models on our forecasting task and find that performance is far below a human expert baseline.
arXiv Detail & Related papers (2022-06-30T17:59:14Z)
Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task. 'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z)
ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data [43.400630267599084]
Event forecasting is a challenging, yet important task, as humans seek to constantly plan for the future. We formulate a task, construct a dataset, and provide benchmarks for developing methods for event forecasting with large volumes of unstructured text data. We present our experiments on ForecastQA using BERT-based models and find that our best model achieves 60.1% accuracy on the dataset, which still lags behind human performance by about 19%.
arXiv Detail & Related papers (2020-05-02T11:03:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.