Transformers through the lens of support-preserving maps between measures
- URL: http://arxiv.org/abs/2509.25611v1
- Date: Tue, 30 Sep 2025 00:15:33 GMT
- Title: Transformers through the lens of support-preserving maps between measures
- Authors: Takashi Furuya, Maarten V. de Hoop, Matti Lassas,
- Abstract summary: We study the question what kind of maps between measures are transformers.<n>On the one hand, these include transformers; on the other hand, transformers universally approximate representations with any continuous in-context map.<n>We prove that the measure-theoretic self-attention has the properties that ensure that the infinite depth, mean-field measure-theoretic transformer can be identified with a Vlasov flow.
- Score: 17.447252333183616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are deep architectures that define ``in-context maps'' which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In previous work, we studied the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly analyze their expressivity, we considered the case that the mappings are conditioned on a context represented by a probability distribution which becomes discrete for a finite number of tokens. Modeling neural networks as maps on probability measures has multiple applications, such as studying Wasserstein regularity, proving generalization bounds and doing a mean-field limit analysis of the dynamics of interacting particles as they go through the network. In this work, we study the question what kind of maps between measures are transformers. We fully characterize the properties of maps between measures that enable these to be represented in terms of in-context maps via a push forward. On the one hand, these include transformers; on the other hand, transformers universally approximate representations with any continuous in-context map. These properties are preserving the cardinality of support and that the regular part of their Fr\'{e}chet derivative is uniformly continuous. Moreover, we show that the solution map of the Vlasov equation, which is of nonlocal transport type, for interacting particle systems in the mean-field regime for the Cauchy problem satisfies the conditions on the one hand and, hence, can be approximated by a transformer; on the other hand, we prove that the measure-theoretic self-attention has the properties that ensure that the infinite depth, mean-field measure-theoretic transformer can be identified with a Vlasov flow.
Related papers
- The Bayesian Geometry of Transformer Attention [0.4779196219827507]
We build controlled environments where the true posterior is known in closed form and memorization is provably impossible.<n>Small transformers reproduce Bayesian posteriors with mbox$10-3$--$10-4$ bit accuracy, while capacity-matched geometrics fail by orders of magnitude.
arXiv Detail & Related papers (2025-12-27T05:28:58Z) - Classical feature map surrogates and metrics for quantum control landscapes [0.0]
We derive and analyze three feature maps of parametrized quantum dynamics, which generalize variational quantum circuits.<n>The Lie-Fourier representation is shown to have a dense spectrum with discrete peaks, that reflects control Hamiltonian properties, but that is compressible in typically found symmetric systems.
arXiv Detail & Related papers (2025-09-30T08:24:13Z) - Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z) - Measure-to-measure interpolation using Transformers [5.290251602267728]
Transformers are deep neural network architectures that underpin the recent successes of large language models.<n>A Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere.<n>We provide an explicit choice of parameters that allows a single Transformer to match $N$ arbitrary input measures to $N$ arbitrary target measures.
arXiv Detail & Related papers (2024-11-07T09:18:39Z) - Transformers are Universal In-context Learners [21.513210412394965]
We show that deep transformers can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains.
A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens.
arXiv Detail & Related papers (2024-08-02T16:21:48Z) - Neural Isometries: Taming Transformations for Equivariant ML [8.203292895010748]
We introduce Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space.
We show that a simple off-the-shelf equivariant network operating in the pre-trained latent space can achieve results on par with meticulously-engineered, handcrafted networks.
arXiv Detail & Related papers (2024-05-29T17:24:25Z) - Mapping of attention mechanisms to a generalized Potts model [50.91742043564049]
We show that training a neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method.
We also compute the generalization error of self-attention in a model scenario analytically using the replica method.
arXiv Detail & Related papers (2023-04-14T16:32:56Z) - Entangled Residual Mappings [59.02488598557491]
We introduce entangled residual mappings to generalize the structure of the residual connections.
An entangled residual mapping replaces the identity skip connections with specialized entangled mappings.
We show that while entangled mappings can preserve the iterative refinement of features across various deep models, they influence the representation learning process in convolutional networks.
arXiv Detail & Related papers (2022-06-02T19:36:03Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings [67.11712279612583]
Cycle-consistent training is widely used for learning a forward and inverse mapping between two domains of interest.
We develop a conditional variational autoencoder (CVAE) approach that can be viewed as converting surjective mappings to implicit bijections.
Our pipeline can capture such many-to-one mappings during cycle training while promoting graph-to-text diversity.
arXiv Detail & Related papers (2020-12-14T10:59:59Z) - Joint Estimation of Image Representations and their Lie Invariants [57.3768308075675]
Images encode both the state of the world and its content.
The automatic extraction of this information is challenging because of the high-dimensionality and entangled encoding inherent to the image representation.
This article introduces two theoretical approaches aimed at the resolution of these challenges.
arXiv Detail & Related papers (2020-12-05T00:07:41Z) - Learning Disentangled Representations with Latent Variation
Predictability [102.4163768995288]
This paper defines the variation predictability of latent disentangled representations.
Within an adversarial generation process, we encourage variation predictability by maximizing the mutual information between latent variations and corresponding image pairs.
We develop an evaluation metric that does not rely on the ground-truth generative factors to measure the disentanglement of latent representations.
arXiv Detail & Related papers (2020-07-25T08:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.