Related papers: The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

URL: http://arxiv.org/abs/2512.12059v1
Date: Fri, 12 Dec 2025 21:59:53 GMT
Title: The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification
Authors: Luke Bhan, Hanyu Zhang, Andrew Gordon Wilson, Michael W. Mahoney, Chuck Arvin,
Abstract summary: We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring.<n>We evaluate the ability of LLMs to assess time series forecast quality.<n>We present three experiments, including on both synthetic and real-world forecasting data.
Score: 74.64864354503204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses. We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning'' capabilities. As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. (1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? (2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? (3) How does performance vary across model sizes and reasoning capabilities, measured across state-of-the-art LLMs? We present three experiments, including on both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. The best-performing model we evaluated achieves an F1 score of 0.88, somewhat below human-level performance (F1 score: 0.97). We also demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: 0.84). Lastly, we demonstrate that these techniques succeed in identifying inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10% higher than that of reasonable forecasts. These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.

Related papers

Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z)
Revisiting LLMs as Zero-Shot Time-Series Forecasters: Small Noise Can Break Large Models [32.30528039193554]
Large Language Models (LLMs) have shown remarkable performance across diverse tasks without domain-specific training.<n>Recent studies suggest that LLMs lack inherent effectiveness in forecasting.<n>Our experiments show that LLM-based zero-shot forecasters often struggle to achieve high accuracy due to their sensitivity to noise.
arXiv Detail & Related papers (2025-05-31T08:24:01Z)
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z)
Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.<n>We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more computation-efficient metric for performance estimation.<n>We present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
Can Language Models Use Forecasting Strategies? [14.332379032371612]
We describe experiments using a novel dataset of real world events and associated human predictions. We find that models still struggle to make accurate predictions about the future.
arXiv Detail & Related papers (2024-06-06T19:01:42Z)
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy [1.999925939110439]
We use an ensemble approach consisting of a crowd of twelve large language models (LLMs) We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
arXiv Detail & Related papers (2024-02-29T17:27:59Z)
ExtremeCast: Boosting Extreme Value Prediction for Global Weather Forecast [57.6987191099507]
We introduce Exloss, a novel loss function that performs asymmetric optimization and highlights extreme values to obtain accurate extreme weather forecast. We also introduce ExBooster, which captures the uncertainty in prediction outcomes by employing multiple random samples. Our solution can achieve state-of-the-art performance in extreme weather prediction, while maintaining the overall forecast accuracy comparable to the top medium-range forecast models.
arXiv Detail & Related papers (2024-02-02T10:34:13Z)
Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI [0.0]
This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact.
arXiv Detail & Related papers (2023-12-12T02:28:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.