Related papers: Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

URL: http://arxiv.org/abs/2412.06061v2
Date: Fri, 28 Feb 2025 20:36:37 GMT
Title: Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond
Authors: Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang,
Abstract summary: We propose the first theoretical explanation of the inefficiency of transformers on TSF tasks.<n>We attribute the mechanism behind it to bf Asymmetric Learning in training attention networks.
Score: 17.002793355495136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The application of transformer-based models on time series forecasting (TSF) tasks has long been popular to study. However, many of these works fail to beat the simple linear residual model, and the theoretical understanding of this issue is still limited. In this work, we propose the first theoretical explanation of the inefficiency of transformers on TSF tasks. We attribute the mechanism behind it to {\bf Asymmetric Learning} in training attention networks. When the sign of the previous step is inconsistent with the sign of the current step in the next-step-prediction time series, attention fails to learn the residual features. This makes it difficult to generalize on out-of-distribution (OOD) data, especially on the sign-inconsistent next-step-prediction data, with the same representation pattern, whereas a linear residual network could easily accomplish it. We hope our theoretical insights provide important necessary conditions for designing the expressive and efficient transformer-based architecture for practitioners.

Related papers

HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens [1.534667887016089]
We introduce history tokens, a novel concept that facilitates the accumulation of historical information during prediction pretraining.<n>Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks.
arXiv Detail & Related papers (2025-08-02T19:50:58Z)
Born a Transformer -- Always a Transformer? [57.37263095476691]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
TransDF: Time-Series Forecasting Needs Transformed Label Alignment [53.33409515800757]
We propose Transform-enhanced Direct Forecast (TransDF), which transforms the label sequence into decorrelated components with discriminated significance.<n>Models are trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount.
arXiv Detail & Related papers (2025-05-23T13:00:35Z)
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps. We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z)
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
GBT: Two-stage transformer framework for non-stationary time series forecasting [3.830797055092574]
We propose GBT, a novel two-stage Transformer framework with Good Beginning. It decouples the prediction process of TSFT into two stages, including Auto-Regression stage and Self-Regression stage. Experiments on seven benchmark datasets demonstrate that GBT outperforms SOTA TSFTs with only canonical attention and convolution.
arXiv Detail & Related papers (2023-07-17T07:55:21Z)
CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting. First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals. Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions. Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
Robust representations of oil wells' intervals via sparse attention mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers) The focus in our experiments is on oil&gas data, namely, well logs. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z)
Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting [86.33543833145457]
We propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Our framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49.43% on Transformer, 47.34% on Informer, and 46.89% on Reformer.
arXiv Detail & Related papers (2022-05-28T12:27:27Z)
Are Transformers Effective for Time Series Forecasting? [13.268196448051308]
Recently, there has been a surge of Transformer-based solutions for the time series forecasting (TSF) task. This study investigates whether Transformer-based techniques are the right solutions for long-term time series forecasting. We find that the relatively higher long-term forecasting accuracy of Transformer-based solutions has little to do with the temporal relation extraction capabilities of the Transformer architecture.
arXiv Detail & Related papers (2022-05-26T17:17:08Z)
NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting [24.510978166050293]
This work is the first attempt to propose a Non-Autoregressive Transformer architecture for time series forecasting. We present a novel spatial-temporal attention mechanism, building a bridge by a learned temporal influence map to fill the gaps between the spatial and temporal attention.
arXiv Detail & Related papers (2021-02-10T18:36:11Z)
Spatio-Temporal Graph Scattering Transform [54.52797775999124]
Graph neural networks may be impractical in some real-world scenarios due to a lack of sufficient high-quality training data. We put forth a novel mathematically designed framework to analyze-temporal data.
arXiv Detail & Related papers (2020-12-06T19:49:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.