Related papers: RiemannFormer: A Framework for Attention in Curved Spaces

RiemannFormer: A Framework for Attention in Curved Spaces

URL: http://arxiv.org/abs/2506.07405v1
Date: Mon, 09 Jun 2025 03:56:18 GMT
Title: RiemannFormer: A Framework for Attention in Curved Spaces
Authors: Zhongping Ji,
Abstract summary: This research endeavors to offer insights into unlocking the further potential of transformer-based architectures.<n>One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers.
Score: 0.43512163406552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This research endeavors to offer insights into unlocking the further potential of transformer-based architectures. One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers. In our framework, the attention mainly involves metric tensors, tangent spaces, inner product, and how they relate to each other. These quantities and structures at discrete positions are intricately interconnected via the parallel transport of tangent vectors. To make the learning process more efficient, we reduce the number of parameters through ingenious predefined configurations. Moreover, we introduce an explicit mechanism to highlight a neighborhood by attenuating the remote values, given that transformers inherently neglect local inductive bias. Experimental results demonstrate that our modules deliver significant performance improvements relative to the baseline. More evaluation experiments on visual and large language models will be launched successively.

Related papers

Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks [0.0]
We investigate the structure of learning dynamics in transformer models through carefully controlled arithmetic tasks.<n>Our results suggest a unifying geometric framework for understanding transformer learning.
arXiv Detail & Related papers (2026-02-11T03:57:46Z)
TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z)
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits [22.333229451408414]
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood.<n>Existing interpretability methods treat attention heads and multilayer perceptron layers (MLPs) as indivisible units, overlooking possibilities of functional substructure learned within them.<n>We introduce a more fine-grained perspective that decomposes these components into singular directions, revealing superposed and independent computations within a single head or mechanistic.
arXiv Detail & Related papers (2025-11-25T12:59:15Z)
Selective Induction Heads: How Transformers Select Causal Structures In Context [50.09964990342878]
We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
arXiv Detail & Related papers (2025-09-09T23:13:41Z)
On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities.<n>They can be disrupted by factually correct context, a phenomenon known as context hijacking.<n>We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z)
A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer.<n>We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z)
Dynamics of Transient Structure in In-Context Linear Regression Transformers [0.5242869847419834]
We show that when transformers are trained on in-context linear regression tasks with intermediate task diversity, they behave like ridge regression before specializing to the tasks in their training distribution.<n>This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis.<n>We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
arXiv Detail & Related papers (2025-01-29T16:32:14Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Transformer Block Coupling and its Correlation with Generalization in LLMs [3.007031501305338]
We analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices.<n>We uncover the phenomenon of textbftransformer block coupling in a multitude of Large Language Models, characterized by the coupling of their top singular vectors across tokens and depth.<n>We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories.
arXiv Detail & Related papers (2024-07-10T16:30:27Z)
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling [10.246977481606427]
We study the mechanisms through which different components of Transformer, such as the dot-product self-attention, affect its expressive power. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads.
arXiv Detail & Related papers (2024-02-01T11:43:13Z)
Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.