Related papers: Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization

Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization

URL: http://arxiv.org/abs/2506.06398v1
Date: Thu, 05 Jun 2025 23:02:18 GMT
Title: Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization
Authors: Yin Li,
Abstract summary: Positional encodings are a core part of transformer-based models.<n>This paper analyzes how various positional encoding methods impact a transformer's expressiveness, generalization ability, and extrapolation to longer sequences.
Score: 10.034655199520168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Positional encodings are a core part of transformer-based models, enabling processing of sequential data without recurrence. This paper presents a theoretical framework to analyze how various positional encoding methods, including sinusoidal, learned, relative, and bias-based methods like Attention with Linear Biases (ALiBi), impact a transformer's expressiveness, generalization ability, and extrapolation to longer sequences. Expressiveness is defined via function approximation, generalization bounds are established using Rademacher complexity, and new encoding methods based on orthogonal functions, such as wavelets and Legendre polynomials, are proposed. The extrapolation capacity of existing and proposed encodings is analyzed, extending ALiBi's biasing approach to a unified theoretical context. Experimental evaluation on synthetic sequence-to-sequence tasks shows that orthogonal transform-based encodings outperform traditional sinusoidal encodings in generalization and extrapolation. This work addresses a critical gap in transformer theory, providing insights for design choices in natural language processing, computer vision, and other transformer applications.

Related papers

On the Existence of Universal Simulators of Attention [17.01811978811789]
We present solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP.<n>Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
arXiv Detail & Related papers (2025-06-23T15:15:25Z)
Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.<n>We propose TEGA, a logic-aware architecture that significantly improves the performance in first-order logical entailment.
arXiv Detail & Related papers (2025-01-01T07:05:32Z)
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks. This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z)
EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer. It provides a unified theoretical framework to formulate both semantic difference and positional difference. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z)
Transduce: learning transduction grammars for string transformation [0.0]
A new algorithm, Transduce, is proposed to learn positional transformations efficiently from one or two positive examples without inductive bias. We experimentally demonstrate that Transduce can learn positional transformations efficiently from one or two positive examples without inductive bias, achieving a success rate higher than the current state of the art.
arXiv Detail & Related papers (2023-12-14T07:59:02Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
Linearized Relative Positional Encoding [43.898057545832366]
Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. We put together a variety of existing linear relative positional encoding approaches under a canonical form. We further propose a family of linear relative positional encoding algorithms via unitary transformation.
arXiv Detail & Related papers (2023-07-18T13:56:43Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Transformer Meets Boundary Value Inverse Problems [4.165221477234755]
Transformer-based deep direct sampling method is proposed for solving a class of boundary value inverse problem. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and reconstructed images.
arXiv Detail & Related papers (2022-09-29T17:45:25Z)
Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths. We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately. The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.