On the Limitations and Capabilities of Position Embeddings for Length Generalization
- URL: http://arxiv.org/abs/2510.04130v1
- Date: Sun, 05 Oct 2025 10:08:33 GMT
- Title: On the Limitations and Capabilities of Position Embeddings for Length Generalization
- Authors: Yang Chen, Yitao Liang, Zhouchen Lin,
- Abstract summary: We study the limitations and capabilities of Position Embeddings (PEs) in achieving Length Generalization (LG) performance.<n>Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions.<n>We propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales.
- Score: 64.50857363288598
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.
Related papers
- YuriiFormer: A Suite of Nesterov-Accelerated Transformers [62.40952219538543]
We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings.<n>In this view, self-attention implements gradient step of an interaction energy, while layers correspond to gradient updates of a potential energy.<n>Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie-Trotter splitting between these two energys.
arXiv Detail & Related papers (2026-01-30T18:06:21Z) - Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression [5.86461706751327]
We provide the first generalization analysis for a single-layer Transformer under in-context regression.<n>Our result shows that PE systematically enlarges the generalization gap.<n>We find that the gap between models with and without PE is magnified under attack, demonstrating that PE amplifies the vulnerability of models.
arXiv Detail & Related papers (2025-12-10T02:55:19Z) - Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions [32.71332125930795]
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL)<n>We investigate Markovian function learning through a structured ICL setup to reveal underlying optimization behaviors.
arXiv Detail & Related papers (2025-10-21T13:42:48Z) - Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability [53.21677928601684]
Layer-wise relevance propagation is one of the most promising approaches to explainability in deep learning.<n>We propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods.<n>Our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks.
arXiv Detail & Related papers (2025-06-02T18:07:55Z) - Born a Transformer -- Always a Transformer? [57.37263095476691]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z) - Massively Scaling Explicit Policy-conditioned Value Functions [16.387595437722613]
We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs)<n>EPVFs learn a value function V(theta) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy.<n>We show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines.
arXiv Detail & Related papers (2025-02-17T16:02:54Z) - Analyzing limits for in-context learning [2.1178416840822027]
In-context learning (ICL) in transformer models trained from scratch, focusing on function normalization tasks as a controlled setting to uncover fundamental behaviors.<n>We show empirically that transformer models can generalize, approximating unseen classes of normalization (non linear) functions, but they cannot generalize beyond certain values.
arXiv Detail & Related papers (2025-02-05T11:03:36Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer does not always lead to enhanced performance.<n>We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - Revisiting Generalized p-Laplacian Regularized Framelet GCNs:
Convergence, Energy Dynamic and Training with Non-Linear Diffusion [44.4195350090039]
This paper presents a theoretical analysis of the graph p-Laplacian regularized framelet network (pL-UFG)
We conduct a convergence analysis on pL-UFG, addressing the gap in the understanding of its behaviors.
We propose two novel pL-UFG models with manually controlled energy dynamics.
arXiv Detail & Related papers (2023-05-25T01:36:34Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.