Exploring the Role of Token in Transformer-based Time Series Forecasting
- URL: http://arxiv.org/abs/2404.10337v3
- Date: Wed, 30 Oct 2024 01:49:45 GMT
- Title: Exploring the Role of Token in Transformer-based Time Series Forecasting
- Authors: Jianqi Zhang, Jingyao Wang, Chuxiong Sun, Xingchen Shen, Fanjiang Xu, Changwen Zheng, Wenwen Qiang,
- Abstract summary: Transformer-based methods are a mainstream approach for solving time series forecasting (TSF)
Most focus on optimizing the model structure, with few studies paying attention to the role of tokens for predictions.
We find that the gradients mainly depend on tokens that contribute to the predicted series, called positive tokens.
To utilize T-PE and V-PE, we propose T2B-PE, a Transformer-based dual-branch framework.
- Score: 10.081240480138487
- License:
- Abstract: Transformer-based methods are a mainstream approach for solving time series forecasting (TSF). These methods use temporal or variable tokens from observable data to make predictions. However, most focus on optimizing the model structure, with few studies paying attention to the role of tokens for predictions. The role is crucial since a model that distinguishes useful tokens from useless ones will predict more effectively. In this paper, we explore this issue. Through theoretical analyses, we find that the gradients mainly depend on tokens that contribute to the predicted series, called positive tokens. Based on this finding, we explore what helps models select these positive tokens. Through a series of experiments, we obtain three observations: i) positional encoding (PE) helps the model identify positive tokens; ii) as the network depth increases, the PE information gradually weakens, affecting the model's ability to identify positive tokens in deeper layers; iii) both enhancing PE in the deeper layers and using semantic-based PE can improve the model's ability to identify positive tokens, thus boosting performance. Inspired by these findings, we design temporal positional encoding (T-PE) for temporal tokens and variable positional encoding (V-PE) for variable tokens. To utilize T-PE and V-PE, we propose T2B-PE, a Transformer-based dual-branch framework. Extensive experiments demonstrate that T2B-PE has superior robustness and effectiveness.
Related papers
- Improving Next Tokens via Second-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder only architecture for predicting the second last token for a sequence of tokens.
Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.
By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation [59.18151483767509]
We introduce a dual-path token lifting for domain shift correction in test time adaptation.
We then perform dual-path lifting with interleaved token prediction and update between the path of domain shift tokens and the path of class tokens.
Experimental results on the benchmark datasets demonstrate that our proposed method significantly improves the online fully test-time domain adaptation performance.
arXiv Detail & Related papers (2024-08-26T02:33:47Z) - Efficiently improving key weather variables forecasting by performing the guided iterative prediction in latent space [0.8885727065823155]
We propose an 'encoding-prediction-decoding' prediction network.
It can adaptively extract key variable-related low-dimensional latent feature from more input atmospheric variables.
We improve the HTA algorithm in citebi2023accurate by inputting more time steps to enhance the temporal correlation between the prediction results and input variables.
arXiv Detail & Related papers (2024-07-27T05:56:37Z) - Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation [24.294049653744185]
In transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation.
We propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics.
Our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
arXiv Detail & Related papers (2024-06-27T17:16:23Z) - ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - CARD: Channel Aligned Robust Blend Transformer for Time Series
Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting.
First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals.
Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions.
Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.