Related papers: Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures

Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures

URL: http://arxiv.org/abs/2501.02931v2
Date: Tue, 14 Jan 2025 10:01:41 GMT
Title: Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures
Authors: Charles O'Neill,
Abstract summary: We develop a category-theoretic framework focusing on the linear components of self-attention.<n>We show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $mathbfPara(Vect)$.<n> stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-attention mechanisms have revolutionised deep learning architectures, yet their core mathematical structures remain incompletely understood. In this work, we develop a category-theoretic framework focusing on the linear components of self-attention. Specifically, we show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $\mathbf{Para(Vect)}$. On the underlying 1-category $\mathbf{Vect}$, these maps induce an endofunctor whose iterated composition precisely models multi-layer attention. We further prove that stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor. For positional encodings, we demonstrate that strictly additive embeddings correspond to monoid actions in an affine sense, while standard sinusoidal encodings, though not additive, retain a universal property among injective (faithful) position-preserving maps. We also establish that the linear portions of self-attention exhibit natural equivariance to permutations of input tokens, and show how the "circuits" identified in mechanistic interpretability can be interpreted as compositions of parametric 1-morphisms. This categorical perspective unifies geometric, algebraic, and interpretability-based approaches to transformer analysis, making explicit the underlying structures of attention. We restrict to linear maps throughout, deferring the treatment of nonlinearities such as softmax and layer normalisation, which require more advanced categorical constructions. Our results build on and extend recent work on category-theoretic foundations for deep learning, offering deeper insights into the algebraic structure of attention mechanisms.

Related papers

DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching [53.39693288324375]
We show that both in-network regularization and functional map training can be replaced with data-driven methods.<n>We first train a generative model of functional maps in the spectral domain using score-based generative modeling.<n>We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections.
arXiv Detail & Related papers (2025-07-31T16:44:54Z)
Quantum cellular automata and categorical duality of spin chains [0.0]
We study categorical dualities, which are bounded-spread isomorphisms between algebras of symmetry-respecting local operators on a spin chain. A fundamental question about dualities is whether they can be extended to quantum cellular automata. We present a solution to the extension problem using the machinery of Doplicher-Haag-Roberts bimodules.
arXiv Detail & Related papers (2024-10-11T15:00:50Z)
Current Symmetry Group Equivariant Convolution Frameworks for Representation Learning [5.802794302956837]
Euclidean deep learning is often inadequate for addressing real-world signals where the representation space is irregular and curved with complex topologies. We focus on the importance of symmetry group equivariant deep learning models and their realization of convolution-like operations on graphs, 3D shapes, and non-Euclidean spaces.
arXiv Detail & Related papers (2024-09-11T15:07:18Z)
Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry [63.694184882697435]
Global Covariance Pooling (GCP) has been demonstrated to improve the performance of Deep Neural Networks (DNNs) by exploiting second-order statistics of high-level representations.
arXiv Detail & Related papers (2024-07-15T07:11:44Z)
Binding Dynamics in Rotating Features [72.80071820194273]
We propose an alternative "cosine binding" mechanism, which explicitly computes the alignment between features and adjusts weights accordingly. This allows us to draw direct connections to self-attention and biological neural processes, and to shed light on the fundamental dynamics for object-centric representations to emerge in Rotating Features.
arXiv Detail & Related papers (2024-02-08T12:31:08Z)
A Unified Framework for Discovering Discrete Symmetries [17.687122467264487]
We consider the problem of learning a function respecting a symmetry from among a class of symmetries. We develop a unified framework that enables symmetry discovery across a broad range of subgroups.
arXiv Detail & Related papers (2023-09-06T10:41:30Z)
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure" We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z)
Understanding Imbalanced Semantic Segmentation Through Neural Collapse [81.89121711426951]
We show that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes. We introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure. Our method ranks 1st and sets a new record on the ScanNet200 test leaderboard.
arXiv Detail & Related papers (2023-01-03T13:51:51Z)
Mathematical Foundations for a Compositional Account of the Bayesian Brain [0.0]
We use the tools of contemporary applied category theory to supply functorial semantics for approximate inference. We define fibrations of statistical games and classify various problems of statistical inference as corresponding sections. We construct functors which explain the compositional structure of predictive coding neural circuits under the free energy principle.
arXiv Detail & Related papers (2022-12-23T18:58:17Z)
Equivariance with Learned Canonicalization Functions [77.32483958400282]
We show that learning a small neural network to perform canonicalization is better than using predefineds. Our experiments show that learning the canonicalization function is competitive with existing techniques for learning equivariant functions across many tasks.
arXiv Detail & Related papers (2022-11-11T21:58:15Z)
Deep Invertible Approximation of Topologically Rich Maps between Manifolds [17.60434807901964]
We show how to design neural networks that allow for stable universal approximation of maps between topologically interesting manifold. By exploiting the topological parallels between locally bilipschitz maps, covering spaces, and local homeomorphisms, we find that a novel network of the form $mathcalT circ p circ mathcalE$ is a universal approximator of local diffeomorphisms. We also outline possible extensions of our architecture to address molecular imaging of molecules with symmetries.
arXiv Detail & Related papers (2022-10-02T17:14:43Z)
Intersection Regularization for Extracting Semantic Attributes [72.53481390411173]
We consider the problem of supervised classification, such that the features that the network extracts match an unseen set of semantic attributes. For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds. We propose training a neural network with discrete top-level activations, which is followed by a multi-layered perceptron (MLP) and a parallel decision tree.
arXiv Detail & Related papers (2021-03-22T14:32:44Z)
Categories of Br\`egman operations and epistemic (co)monads [0.0]
We construct a categorical framework for nonlinear postquantum inference, with embeddings of convex closed sets of suitable reflexive Banach spaces as objects. It provides a nonlinear convex analytic analogue of Chencov's programme of study of categories of linear positive maps between spaces of states. We show that the bregmanian approach provides some special cases of this setting.
arXiv Detail & Related papers (2021-03-13T23:10:29Z)
K\"ahler Geometry of Quiver Varieties and Machine Learning [0.0]
We develop an algebro-geometric formulation for neural networks in machine learning using the moduli space of framed representations quiver. We prove the universal approximation theorem for the multi-variable activation function constructed from the complex projective space.
arXiv Detail & Related papers (2021-01-27T15:32:24Z)
Building powerful and equivariant graph neural networks with structural message-passing [74.93169425144755]
We propose a powerful and equivariant message-passing framework based on two ideas. First, we propagate a one-hot encoding of the nodes, in addition to the features, in order to learn a local context matrix around each node. Second, we propose methods for the parametrization of the message and update functions that ensure permutation equivariance.
arXiv Detail & Related papers (2020-06-26T17:15:16Z)
Equivariant Maps for Hierarchical Structures [17.931059591895984]
We show that symmetry of a hierarchical structure is the "wreath product" of symmetries of the building blocks. By voxelizing the point cloud, we impose a hierarchy of translation and permutation symmetries on the data. We report state-of-the-art on Semantic3D, S3DIS, and vKITTI, that include some of the largest real-world point-cloud benchmarks.
arXiv Detail & Related papers (2020-06-05T18:42:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.