Related papers: Why Do Transformers Fail to Forecast Time Series In-Context?

Why Do Transformers Fail to Forecast Time Series In-Context?

URL: http://arxiv.org/abs/2510.09776v1
Date: Fri, 10 Oct 2025 18:34:19 GMT
Title: Why Do Transformers Fail to Forecast Time Series In-Context?
Authors: Yufa Zhou, Yixiao Wang, Surbhi Goel, Anru R. Zhang,
Abstract summary: Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning.<n> Empirical evidence consistently shows that even powerful Transformers often fail to outperform simpler models.<n>We provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory.
Score: 21.43699354236011
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning, despite significant recent efforts leveraging Large Language Models (LLMs), which predominantly rely on Transformer architectures. Empirical evidence consistently shows that even powerful Transformers often fail to outperform much simpler models, e.g., linear models, on TSF tasks; however, a rigorous theoretical understanding of this phenomenon remains limited. In this paper, we provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory. Specifically, under AR($p$) data, we establish that: (1) Linear Self-Attention (LSA) models $\textit{cannot}$ achieve lower expected MSE than classical linear models for in-context forecasting; (2) as the context length approaches to infinity, LSA asymptotically recovers the optimal linear predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions collapse to the mean exponentially. We empirically validate these findings through carefully designed experiments. Our theory not only sheds light on several previously underexplored phenomena but also offers practical insights for designing more effective forecasting architectures. We hope our work encourages the broader research community to revisit the fundamental theoretical limitations of TSF and to critically evaluate the direct application of increasingly sophisticated architectures without deeper scrutiny.

Related papers

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z)
Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization [8.58740389510812]
We develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss.<n>An excessively large FLS induces an $textitover-alignment$ phenomenon that degrades generalization, while an overly small FLS leads to $textitover-fitting$.
arXiv Detail & Related papers (2026-01-31T17:43:02Z)
How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z)
ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters [48.79331759671512]
We introduce a Multiple Neural Forecasting Theorem that provides a theoretical basis for our approach.<n>We propose Boosted Direct Output (BDO), a novel forecasting strategy that combines the advantages of both Auto-Regressive (AR) and Direct Output (DO)
arXiv Detail & Related papers (2025-09-30T08:05:59Z)
How LLMs Learn to Reason: A Complex Network Perspective [14.638878448692493]
Training large language models with Reinforcement Learning from Verifiable Rewards exhibits a set of puzzling behaviors.<n>We propose that these seemingly disparate phenomena can be explained using a single unifying theory.<n>Our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
arXiv Detail & Related papers (2025-09-28T04:10:37Z)
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training [36.69514399442043]
This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT)<n>Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks.
arXiv Detail & Related papers (2025-07-07T18:17:06Z)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond [17.002793355495136]
We propose the first theoretical explanation of the inefficiency of transformers on TSF tasks.<n>We attribute the mechanism behind it to bf Asymmetric Learning in training attention networks.
arXiv Detail & Related papers (2024-12-08T20:29:06Z)
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.<n>This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.