Related papers: Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

URL: http://arxiv.org/abs/2509.23074v1
Date: Sat, 27 Sep 2025 02:56:06 GMT
Title: Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting
Authors: Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li,
Abstract summary: We introduce a predictability-aligned diagnostic framework grounded in spectral coherence.<n>We provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time.<n>Our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks.
Score: 18.018179328110048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

Related papers

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z)
SynTSBench: Rethinking Temporal Pattern Learning in Deep Learning Models for Time Series [11.314952720053464]
We propose a synthetic data-driven evaluation paradigm, SynTSBench, for time series forecasting models.<n>Our framework isolates confounding factors and establishes an interpretable evaluation system with three core analytical dimensions.<n>Our experiments show that current deep learning models do not universally approach optimal baselines across all types of temporal features.
arXiv Detail & Related papers (2025-10-23T06:59:38Z)
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z)
Revisiting Multivariate Time Series Forecasting with Missing Values [74.56971641937771]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z)
Learning Temporal Saliency for Time Series Forecasting with Cross-Scale Attention [5.992220383989106]
We present CrossScaleNet, an innovative architecture that combines a patch-based cross-attention mechanism with multi-scale processing.<n>Our evaluations demonstrate superior performance in both temporal saliency detection and forecasting accuracy.
arXiv Detail & Related papers (2025-09-26T18:43:51Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [78.18946529195254]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments [5.5855749614100825]
This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction.<n>We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem.<n>Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.
arXiv Detail & Related papers (2025-05-25T23:17:47Z)
ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables [30.679739751673655]
This paper introduces a new method to incorporate covariates into pretrained time series forecasting models.<n>Our proposed approach incorporates covariate information into pretrained forecasting models through modular blocks.<n>In evaluations on both synthetic and real datasets, our approach effectively incorporates covariate information into pretrained models, outperforming existing baselines.
arXiv Detail & Related papers (2025-03-15T12:34:19Z)
Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Incremental Outlier Detection Modelling Using Streaming Analytics in Finance & Health Care [0.0]
In the era of real-time data, traditional methods often struggle to keep pace with the dynamic nature of streaming environments.<n>In this paper, we proposed a hybrid framework where the model is built once and evaluated in a real-time environment.<n>We employed 8 distinct state-of-the-art outlier detection models, including one-class support vector machine (OCSVM), isolation forest adaptive sliding window approach (IForest ASD), exact storm (ES), angle-based outlier detection (ABOD), local outlier factor (LOF), Kitsunes online algorithm (KitNet), and K-nearest neighbour
arXiv Detail & Related papers (2023-05-17T02:30:28Z)
Mlinear: Rethink the Linear Model for Time-series Forecasting [9.841293660201261]
Mlinear is a simple yet effective method based mainly on linear layers. We introduce a new loss function that significantly outperforms the widely used mean squared error (MSE) on multiple datasets. Our method significantly outperforms PatchTST with a ratio of 21:3 at 336 sequence length input and 29:10 at 512 sequence length input.
arXiv Detail & Related papers (2023-05-08T15:54:18Z)
Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task. 'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z)
Learning Prediction Intervals for Model Performance [1.433758865948252]
We propose a method to compute prediction intervals for model performance. We evaluate our approach across a wide range of drift conditions and show substantial improvement over competitive baselines.
arXiv Detail & Related papers (2020-12-15T21:32:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.