Related papers: Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture

Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture

URL: http://arxiv.org/abs/2511.13780v1
Date: Sun, 16 Nov 2025 02:25:04 GMT
Title: Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
Authors: Nihal Mehta,
Abstract summary: We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context.<n>Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.

Related papers

Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning [0.0]
Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation.<n>This paper develops a unified perspective on these phenomena by interpreting self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA)<n>In this view, queries and keys define role spaces, values encode fillers, attention weights perform soft unbinding, and residual connections realize superposition of many bound structures.
arXiv Detail & Related papers (2025-12-08T05:38:24Z)
Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning [50.99796659680724]
This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed.<n>We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization.<n>We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.
arXiv Detail & Related papers (2025-10-15T21:03:59Z)
A Mathematical Explanation of Transformers for Large Language Models and GPTs [6.245431127481903]
We propose a novel continuous framework that interprets the Transformer as a discretization of a structured integro-differential equation.<n>Within this formulation, the self-attention mechanism emerges naturally as a non-local integral operator.<n>Our approach extends beyond previous theoretical analyses by embedding the entire Transformer operation in continuous domains.
arXiv Detail & Related papers (2025-10-05T01:16:08Z)
Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models [7.244206185339429]
We introduce a theoretical framework that characterizes the expressivity of Transformer-based models equipped with widely used pooling methods.<n>We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding.<n>Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks.
arXiv Detail & Related papers (2025-10-02T11:17:24Z)
Simplicial methods in the resource theory of contextuality [0.0]
Building on the theory of simplicial distributions, we introduce event scenarios as a functorial generalization of presheaf-theoretic measurement scenarios.<n>We define symmetric monoidal structures on these categories and extend the distribution functor to a setting, yielding a resource theory that generalizes the presheaf-theoretic notion of simulations.
arXiv Detail & Related papers (2025-05-29T21:14:55Z)
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings [67.5600169375126]
We study the task of panoptic symbol spotting in computer-aided design (CAD) drawings composed of vector graphical primitives.<n>Existing methods typically rely on imageization, graph construction, or point-based representation.<n>We propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives.
arXiv Detail & Related papers (2025-05-29T12:33:11Z)
Constrained belief updates explain geometric structures in transformer representations [1.1666234644810893]
We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models.<n>Our analysis focuses on single-layer transformers, revealing how the first attention layer implements constrained updates.<n>We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail.
arXiv Detail & Related papers (2025-02-04T03:03:54Z)
Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures [0.0]
We develop a category-theoretic framework focusing on the linear components of self-attention.<n>We show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $mathbfPara(Vect)$.<n> stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor.
arXiv Detail & Related papers (2025-01-06T11:14:18Z)
A Theoretical Analysis of Self-Supervised Learning for Vision Transformers [66.08606211686339]
Masked autoencoders (MAE) and contrastive learning (CL) capture different types of representations.<n>We study the training dynamics of one-layer softmax-based vision transformers (ViTs) on both MAE and CL objectives.
arXiv Detail & Related papers (2024-03-04T17:24:03Z)
Nonparametric Partial Disentanglement via Mechanism Sparsity: Sparse Actions, Interventions and Sparse Temporal Dependencies [58.179981892921056]
This work introduces a novel principle for disentanglement we call mechanism sparsity regularization. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors. We show that the latent factors can be recovered by regularizing the learned causal graph to be sparse.
arXiv Detail & Related papers (2024-01-10T02:38:21Z)
HiPerformer: Hierarchically Permutation-Equivariant Transformer for Time Series Forecasting [56.95572957863576]
We propose a hierarchically permutation-equivariant model that considers both the relationship among components in the same group and the relationship among groups. The experiments conducted on real-world data demonstrate that the proposed method outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2023-05-14T05:11:52Z)
Image Synthesis via Semantic Composition [74.68191130898805]
We present a novel approach to synthesize realistic images based on their semantic layouts. It hypothesizes that for objects with similar appearance, they share similar representation. Our method establishes dependencies between regions according to their appearance correlation, yielding both spatially variant and associated representations.
arXiv Detail & Related papers (2021-09-15T02:26:07Z)
Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.