Related papers: Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling

Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling

URL: http://arxiv.org/abs/2505.13027v1
Date: Mon, 19 May 2025 12:11:13 GMT
Title: Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling
Authors: Zihan Gu, Han Zhang, Ruoyu Chen, Yue Hu, Hua Zhang,
Abstract summary: Positional encoding (PE) is essential for enabling Transformers to model sequential structure.<n>We present a unified framework that analyzes PE through the spectral properties of Toeplitz and related matrices.<n>We establish explicit content-relative mixing with relative-position Toeplitz signals as a key principle for effective PE design.
Score: 10.931433906211534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Positional encoding (PE) is essential for enabling Transformers to model sequential structure. However, the mechanisms by which different PE schemes couple token content and positional information-and how these mechanisms influence model dynamics-remain theoretically underexplored. In this work, we present a unified framework that analyzes PE through the spectral properties of Toeplitz and related matrices derived from attention logits. We show that multiplicative content-position coupling-exemplified by Rotary Positional Encoding (RoPE) via a Hadamard product with a Toeplitz matrix-induces spectral contraction, which theoretically improves optimization stability and efficiency. Guided by this theory, we construct synthetic tasks that contrast content-position dependent and content-position independent settings, and evaluate a range of PE methods. Our experiments reveal strong alignment with theory: RoPE consistently outperforms other methods on position-sensitive tasks and induces "single-head deposit" patterns in early layers, indicating localized positional processing. Further analyses show that modifying the method and timing of PE coupling, such as MLA in Deepseek-V3, can effectively mitigate this concentration. These results establish explicit content-relative mixing with relative-position Toeplitz signals as a key principle for effective PE design and provide new insight into how positional structure is integrated in Transformer architectures.

Related papers

Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z)
Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability [53.21677928601684]
Layer-wise relevance propagation is one of the most promising approaches to explainability in deep learning.<n>We propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods.<n>Our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks.
arXiv Detail & Related papers (2025-06-02T18:07:55Z)
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers [0.0]
Positional embeddings play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention.<n>Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings.<n>We propose LOOPE, a learnable patch-ordering method that optimize spatial representation for a given set of frequencies.
arXiv Detail & Related papers (2025-04-19T19:20:47Z)
Manifestation of critical effects in environmental parameter estimation using a quantum sensor under dynamical control [0.0]
We investigate the emergence of critical behavior in the estimation of the environmental memory time $tau_c$.<n>Our findings pave the way for adaptive control strategies aimed at enhancing precision in quantum parameter estimation.
arXiv Detail & Related papers (2025-04-11T08:42:29Z)
Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models [55.46269953415811]
We identify ToM-sensitive parameters and show that perturbing as little as 0.001% of these parameters significantly degrades ToM performance.<n>Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.
arXiv Detail & Related papers (2025-04-05T17:45:42Z)
Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes.<n>We introduce several strategies to approximate relative positional encoding (RPE) in spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z)
Reward driven workflows for unsupervised explainable analysis of phases and ferroic variants from atomically resolved imaging data [14.907891992968361]
We show that a reward-driven approach can be used to optimize key hyper parameters in unsupervised ML methods. This approach allows us to discover local descriptors that are best aligned with the specific physical behavior. We also extend the reward driven to disentangle structural factors of variation via variational autoencoder (VAE)
arXiv Detail & Related papers (2024-11-19T16:18:20Z)
Beyond Position: the emergence of wavelet-like properties in Transformers [7.3645788720974465]
This paper studies how transformer models develop robust wavelet-like properties that effectively compensate for the theoretical limitations of Rotary Position Embeddings (RoPE)<n>We show that attention heads naturally evolve to implement multi-resolution processing analogous to wavelet transforms.
arXiv Detail & Related papers (2024-10-23T17:48:28Z)
A Theoretical Analysis of Self-Supervised Learning for Vision Transformers [66.08606211686339]
Masked autoencoders (MAE) and contrastive learning (CL) capture different types of representations.<n>We study the training dynamics of one-layer softmax-based vision transformers (ViTs) on both MAE and CL objectives.
arXiv Detail & Related papers (2024-03-04T17:24:03Z)
ASR: Attention-alike Structural Re-parameterization [53.019657810468026]
We propose a simple-yet-effective attention-alike structural re- parameterization (ASR) that allows us to achieve SRP for a given network while enjoying the effectiveness of the attention mechanism. In this paper, we conduct extensive experiments from a statistical perspective and discover an interesting phenomenon Stripe Observation, which reveals that channel attention values quickly approach some constant vectors during training.
arXiv Detail & Related papers (2023-04-13T08:52:34Z)
PHN: Parallel heterogeneous network with soft gating for CTR prediction [2.9722444664527243]
This paper proposes a Parallel Heterogeneous Network (PHN) model, which constructs a network with parallel structure. residual link with trainable parameters are used in the network to mitigate the influence of weak gradient phenomenon.
arXiv Detail & Related papers (2022-06-18T11:37:53Z)
Spectral Tensor Train Parameterization of Deep Learning Layers [136.4761580842396]
We study low-rank parameterizations of weight matrices with embedded spectral properties in the Deep Learning context. We show the effects of neural network compression in the classification setting and both compression and improved stability training in the generative adversarial training setting.
arXiv Detail & Related papers (2021-03-07T00:15:44Z)
Repulsive Mixture Models of Exponential Family PCA for Clustering [127.90219303669006]
The mixture extension of exponential family principal component analysis ( EPCA) was designed to encode much more structural information about data distribution than the traditional EPCA. The traditional mixture of local EPCAs has the problem of model redundancy, i.e., overlaps among mixing components, which may cause ambiguity for data clustering. In this paper, a repulsiveness-encouraging prior is introduced among mixing components and a diversified EPCA mixture (DEPCAM) model is developed in the Bayesian framework.
arXiv Detail & Related papers (2020-04-07T04:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.