FOReCAst: The Future Outcome Reasoning and Confidence Assessment Benchmark
- URL: http://arxiv.org/abs/2502.19676v3
- Date: Tue, 22 Apr 2025 04:15:56 GMT
- Title: FOReCAst: The Future Outcome Reasoning and Confidence Assessment Benchmark
- Authors: Zhangdie Yuan, Zifeng Ding, Andreas Vlachos,
- Abstract summary: FOReCAst is a benchmark that evaluates models' ability to make predictions and their confidence in them.<n>It spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation.<n>It provides a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.
- Score: 11.149409619312827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Forecasting is an important task in many domains, such as technology and economics. However existing forecasting benchmarks largely lack comprehensive confidence assessment, focus on limited question types, and often consist of artificial questions that do not align with real-world human forecasting needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models' ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.
Related papers
- PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation [46.3251656496956]
Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events.
Several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task.
We introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval.
arXiv Detail & Related papers (2025-04-02T08:57:42Z) - Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.<n>We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z) - The Certainty Ratio $C_ρ$: a novel metric for assessing the reliability of classifier predictions [0.0]
This paper introduces the Certainty Ratio ($C_rho$), a novel metric designed to quantify the contribution of confident (certain) versus uncertain predictions to any classification performance measure.
Experimental results across 21 datasets and multiple classifiers, including Decision Trees, Naive-Bayes, 3-Nearest Neighbors, and Random Forests, demonstrate that $C_rho$rho reveals critical insights that conventional metrics often overlook.
arXiv Detail & Related papers (2024-11-04T10:50:03Z) - Forecasting Company Fundamentals [19.363166648866066]
We evaluate 22 deterministic and probabilistic company fundamentals forecasting models on real company data.
We find that deep learning models provide superior forcasting performance to classical models.
We show how these high-quality forecasts can benefit automated stock allocation.
arXiv Detail & Related papers (2024-10-21T14:21:43Z) - Regions of Reliability in the Evaluation of Multivariate Probabilistic
Forecasts [73.33395097728128]
We provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation.
We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions.
arXiv Detail & Related papers (2023-04-19T17:38:42Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - What Should I Know? Using Meta-gradient Descent for Predictive Feature
Discovery in a Single Stream of Experience [63.75363908696257]
computational reinforcement learning seeks to construct an agent's perception of the world through predictions of future sensations.
An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making.
We introduce a meta-gradient descent process by which an agent learns what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward.
arXiv Detail & Related papers (2022-06-13T21:31:06Z) - Evaluation of Machine Learning Techniques for Forecast Uncertainty
Quantification [0.13999481573773068]
Ensemble forecasting is, so far, the most successful approach to produce relevant forecasts along with an estimation of their uncertainty.
Main limitations of ensemble forecasting are the high computational cost and the difficulty to capture and quantify different sources of uncertainty.
In this work proof-of-concept model experiments are conducted to examine the performance of ANNs trained to predict a corrected state of the system and the state uncertainty using only a single deterministic forecast as input.
arXiv Detail & Related papers (2021-11-29T16:52:17Z) - Quantifying Uncertainty in Deep Spatiotemporal Forecasting [67.77102283276409]
We describe two types of forecasting problems: regular grid-based and graph-based.
We analyze UQ methods from both the Bayesian and the frequentist point view, casting in a unified framework via statistical decision theory.
Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical computational trade-offs for different UQ methods.
arXiv Detail & Related papers (2021-05-25T14:35:46Z) - Demand Forecasting of Individual Probability Density Functions with
Machine Learning [0.0]
This work proposes new techniques for assessing the accuracy of predicted distributions.
Using the supervised machine learning method "Cyclic Boosting", complete individual probability density functions can be predicted such that each prediction is fully explainable.
arXiv Detail & Related papers (2020-09-15T13:05:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.