Related papers: Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

URL: http://arxiv.org/abs/2404.10337v4
Date: Fri, 24 Oct 2025 08:14:46 GMT
Title: Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting
Authors: Jianqi Zhang, Wenwen Qiang, Jingyao Wang, Jiahuan Zhou, Changwen Zheng, Hui Xiong,
Abstract summary: Transformer-based methods have achieved state-of-the-art performance in time series forecasting (TSF)<n>It remains unclear whether existing Transformers fully leverage the intrinsic topological structure among tokens throughout intermediate layers.<n>We propose the Topology Enhancement Method (TEM), a novel Transformer-based TSF method that explicitly and adaptively preserves token-level topology.
Score: 52.364260925700485
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based methods have achieved state-of-the-art performance in time series forecasting (TSF) by capturing positional and semantic topological relationships among input tokens. However, it remains unclear whether existing Transformers fully leverage the intrinsic topological structure among tokens throughout intermediate layers. Through empirical and theoretical analyses, we identify that current Transformer architectures progressively degrade the original positional and semantic topology of input tokens as the network deepens, thus limiting forecasting accuracy. Furthermore, our theoretical results demonstrate that explicitly enforcing preservation of these topological structures within intermediate layers can tighten generalization bounds, leading to improved forecasting performance. Motivated by these insights, we propose the Topology Enhancement Method (TEM), a novel Transformer-based TSF method that explicitly and adaptively preserves token-level topology. TEM consists of two core modules: 1) the Positional Topology Enhancement Module (PTEM), which injects learnable positional constraints to explicitly retain original positional topology; 2) the Semantic Topology Enhancement Module (STEM), which incorporates a learnable similarity matrix to preserve original semantic topology. To determine optimal injection weights adaptively, TEM employs a bi-level optimization strategy. The proposed TEM is a plug-and-play method that can be integrated with existing Transformer-based TSF methods. Extensive experiments demonstrate that integrating TEM with a variety of existing methods significantly improves their predictive performance, validating the effectiveness of explicitly preserving original token-level topology. Our code is publicly available at: \href{https://github.com/jlu-phyComputer/TEM}{https://github.com/jlu-phyComputer/TEM}.

Related papers

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models [13.707653566827704]
Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret.<n>Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification.<n>We propose a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients.
arXiv Detail & Related papers (2026-02-18T17:03:10Z)
Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z)
Graded Transformers [0.0]
We introduce the Graded Transformer framework, a new class of sequence models that embed inductive biases through grading on vector spaces.<n>The framework enables adaptive feature prioritization, overcoming limitations of fixed grades in earlier models.<n>The Graded Transformer provides a mathematically principled approach to hierarchical learning and neuro-symbolic reasoning.
arXiv Detail & Related papers (2025-07-27T02:34:08Z)
IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation [70.2753541780788]
We introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.<n>IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
arXiv Detail & Related papers (2025-06-16T08:28:19Z)
Learning to Insert [PAUSE] Tokens for Better Reasoning [6.823521786512908]
We introduce a novel approach termed Dynamic Inserting Tokens Training (DIT)<n>Our method identifies positions within sequences where model confidence is lowest according to token log-likelihood.<n>With this simple yet effective method, we achieve accuracy gains of up to 4.7%p on GSM8K, 3.23%p on AQUA-RAT, and pass@1 improvements of up to 3.4%p on MBPP datasets.
arXiv Detail & Related papers (2025-06-04T06:48:41Z)
Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability [53.21677928601684]
Layer-wise relevance propagation is one of the most promising approaches to explainability in deep learning.<n>We propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods.<n>Our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks.
arXiv Detail & Related papers (2025-06-02T18:07:55Z)
BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
ORIGAMI: A generative transformer architecture for predictions from semi-structured data [3.5639148953570836]
ORIGAMI is a transformer-based architecture that processes nested key/value pairs.<n>By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks.
arXiv Detail & Related papers (2024-12-23T07:21:17Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling. We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
Improving Next Tokens via Second-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder only architecture for predicting the second last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z)
Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation [59.18151483767509]
We introduce a dual-path token lifting for domain shift correction in test time adaptation. We then perform dual-path lifting with interleaved token prediction and update between the path of domain shift tokens and the path of class tokens. Experimental results on the benchmark datasets demonstrate that our proposed method significantly improves the online fully test-time domain adaptation performance.
arXiv Detail & Related papers (2024-08-26T02:33:47Z)
Efficiently improving key weather variables forecasting by performing the guided iterative prediction in latent space [0.8885727065823155]
We propose an 'encoding-prediction-decoding' prediction network. It can adaptively extract key variable-related low-dimensional latent feature from more input atmospheric variables. We improve the HTA algorithm in citebi2023accurate by inputting more time steps to enhance the temporal correlation between the prediction results and input variables.
arXiv Detail & Related papers (2024-07-27T05:56:37Z)
Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation [24.294049653744185]
In transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. We propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
arXiv Detail & Related papers (2024-06-27T17:16:23Z)
ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer. We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z)
Language Models as Hierarchy Encoders [22.03504018330068]
We introduce a novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs) Our method situates the output embedding space of pre-trained LMs within a Poincar'e ball with a curvature that adapts to the embedding dimension. We evaluate HiTs against pre-trained LMs, standard fine-tuned LMs, and several hyperbolic embedding baselines.
arXiv Detail & Related papers (2024-01-21T02:29:12Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting. First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals. Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions. Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z)
Guiding the PLMs with Semantic Anchors as Intermediate Supervision: Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network. By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks. We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z)
Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs) SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
ORCHARD: A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning [8.004425059996963]
We show that Transformer and LSTM models surprisingly fail in systematic generalization. We also show that with increased references between hierarchies, Transformer performs no better than random.
arXiv Detail & Related papers (2021-11-28T03:11:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.