Related papers: Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding

Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding

URL: http://arxiv.org/abs/2512.03058v1
Date: Tue, 25 Nov 2025 19:39:57 GMT
Title: Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding
Authors: Duy-Tung Pham, An The Nguyen, Viet-Hoang Tran, Nhan-Phu Chung, Xin T. Tong, Tan M. Nguyen, Thieu N. Vo,
Abstract summary: We characterize when tokens move closer to or farther from one another over time, depending on the model parameters.<n>We investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes.<n>Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding.
Score: 5.2482659629416535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates the dynamical properties of tokens in pre-trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous-time limit of the pre-trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real-world models. Furthermore, we investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. These findings support theoretical foundations and design principles for improving Transformer models.

Related papers

Task-Level Insights from Eigenvalues across Sequence Models [41.79939327722031]
We show that eigenvalues influence essential aspects of memory and long-range dependency modeling.<n>We then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance.<n>This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.
arXiv Detail & Related papers (2025-10-10T13:35:21Z)
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [57.19302613163439]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z)
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning [30.781578037476347]
We introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs)<n>Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index.<n>Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets.
arXiv Detail & Related papers (2025-03-03T09:12:14Z)
Beyond Position: the emergence of wavelet-like properties in Transformers [6.552700667389349]
We show that attention heads evolve to implement multi-resolution processing analogous to wavelet transforms.<n>Our findings suggest that the effectiveness of modern Transformers stems from their remarkable ability to spontaneously develop optimal, multi-resolution decompositions.
arXiv Detail & Related papers (2024-10-23T17:48:28Z)
Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces novel deep dynamical models designed to represent continuous-time sequences.<n>We train the model using maximum likelihood estimation with Markov chain Monte Carlo.<n> Experimental results on oscillating systems, videos and real-world state sequences (MuJoCo) demonstrate that our model with the learnable energy-based prior outperforms existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Dynamical Mean-Field Theory of Self-Attention Neural Networks [0.0]
Transformer-based models have demonstrated exceptional performance across diverse domains. Little is known about how they operate or what are their expected dynamics. We use methods for the study of asymmetric Hopfield networks in nonequilibrium regimes.
arXiv Detail & Related papers (2024-06-11T13:29:34Z)
Learning minimal representations of stochastic processes with variational autoencoders [52.99137594502433]
We introduce an unsupervised machine learning approach to determine the minimal set of parameters required to describe a process. Our approach enables for the autonomous discovery of unknown parameters describing processes.
arXiv Detail & Related papers (2023-07-21T14:25:06Z)
Impact of conditional modelling for a universal autoregressive quantum state [0.0]
We introduce filters as analogues to convolutional layers in neural networks to incorporate translationally symmetrized correlations in arbitrary quantum states. We analyze the impact of the resulting inductive biases on variational flexibility, symmetries, and conserved quantities.
arXiv Detail & Related papers (2023-06-09T14:17:32Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables. We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph. Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z)
Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization [60.73540999409032]
We show that expressive autoregressive dynamics models generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We also show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer.
arXiv Detail & Related papers (2021-04-28T16:48:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.