Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures
- URL: http://arxiv.org/abs/2501.02931v2
- Date: Tue, 14 Jan 2025 10:01:41 GMT
- Title: Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures
- Authors: Charles O'Neill,
- Abstract summary: We develop a category-theoretic framework focusing on the linear components of self-attention.
We show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $mathbfPara(Vect)$.
stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor.
- Score: 0.0
- License:
- Abstract: Self-attention mechanisms have revolutionised deep learning architectures, yet their core mathematical structures remain incompletely understood. In this work, we develop a category-theoretic framework focusing on the linear components of self-attention. Specifically, we show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $\mathbf{Para(Vect)}$. On the underlying 1-category $\mathbf{Vect}$, these maps induce an endofunctor whose iterated composition precisely models multi-layer attention. We further prove that stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor. For positional encodings, we demonstrate that strictly additive embeddings correspond to monoid actions in an affine sense, while standard sinusoidal encodings, though not additive, retain a universal property among injective (faithful) position-preserving maps. We also establish that the linear portions of self-attention exhibit natural equivariance to permutations of input tokens, and show how the "circuits" identified in mechanistic interpretability can be interpreted as compositions of parametric 1-morphisms. This categorical perspective unifies geometric, algebraic, and interpretability-based approaches to transformer analysis, making explicit the underlying structures of attention. We restrict to linear maps throughout, deferring the treatment of nonlinearities such as softmax and layer normalisation, which require more advanced categorical constructions. Our results build on and extend recent work on category-theoretic foundations for deep learning, offering deeper insights into the algebraic structure of attention mechanisms.
Related papers
- Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry [63.694184882697435]
Global Covariance Pooling (GCP) has been demonstrated to improve the performance of Deep Neural Networks (DNNs) by exploiting second-order statistics of high-level representations.
This paper provides a comprehensive and unified understanding of the matrix logarithm and power from a Riemannian geometry perspective.
arXiv Detail & Related papers (2024-07-15T07:11:44Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Understanding Imbalanced Semantic Segmentation Through Neural Collapse [81.89121711426951]
We show that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes.
We introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure.
Our method ranks 1st and sets a new record on the ScanNet200 test leaderboard.
arXiv Detail & Related papers (2023-01-03T13:51:51Z) - Mathematical Foundations for a Compositional Account of the Bayesian
Brain [0.0]
We use the tools of contemporary applied category theory to supply functorial semantics for approximate inference.
We define fibrations of statistical games and classify various problems of statistical inference as corresponding sections.
We construct functors which explain the compositional structure of predictive coding neural circuits under the free energy principle.
arXiv Detail & Related papers (2022-12-23T18:58:17Z) - Equivariance with Learned Canonicalization Functions [77.32483958400282]
We show that learning a small neural network to perform canonicalization is better than using predefineds.
Our experiments show that learning the canonicalization function is competitive with existing techniques for learning equivariant functions across many tasks.
arXiv Detail & Related papers (2022-11-11T21:58:15Z) - Deep Invertible Approximation of Topologically Rich Maps between
Manifolds [17.60434807901964]
We show how to design neural networks that allow for stable universal approximation of maps between topologically interesting manifold.
By exploiting the topological parallels between locally bilipschitz maps, covering spaces, and local homeomorphisms, we find that a novel network of the form $mathcalT circ p circ mathcalE$ is a universal approximator of local diffeomorphisms.
We also outline possible extensions of our architecture to address molecular imaging of molecules with symmetries.
arXiv Detail & Related papers (2022-10-02T17:14:43Z) - Intersection Regularization for Extracting Semantic Attributes [72.53481390411173]
We consider the problem of supervised classification, such that the features that the network extracts match an unseen set of semantic attributes.
For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds.
We propose training a neural network with discrete top-level activations, which is followed by a multi-layered perceptron (MLP) and a parallel decision tree.
arXiv Detail & Related papers (2021-03-22T14:32:44Z) - Categories of Br\`egman operations and epistemic (co)monads [0.0]
We construct a categorical framework for nonlinear postquantum inference, with embeddings of convex closed sets of suitable reflexive Banach spaces as objects.
It provides a nonlinear convex analytic analogue of Chencov's programme of study of categories of linear positive maps between spaces of states.
We show that the bregmanian approach provides some special cases of this setting.
arXiv Detail & Related papers (2021-03-13T23:10:29Z) - Building powerful and equivariant graph neural networks with structural
message-passing [74.93169425144755]
We propose a powerful and equivariant message-passing framework based on two ideas.
First, we propagate a one-hot encoding of the nodes, in addition to the features, in order to learn a local context matrix around each node.
Second, we propose methods for the parametrization of the message and update functions that ensure permutation equivariance.
arXiv Detail & Related papers (2020-06-26T17:15:16Z) - Equivariant Maps for Hierarchical Structures [17.931059591895984]
We show that symmetry of a hierarchical structure is the "wreath product" of symmetries of the building blocks.
By voxelizing the point cloud, we impose a hierarchy of translation and permutation symmetries on the data.
We report state-of-the-art on Semantic3D, S3DIS, and vKITTI, that include some of the largest real-world point-cloud benchmarks.
arXiv Detail & Related papers (2020-06-05T18:42:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.