Exact Sequence Interpolation with Transformers
- URL: http://arxiv.org/abs/2502.02270v2
- Date: Wed, 29 Oct 2025 17:15:10 GMT
- Title: Exact Sequence Interpolation with Transformers
- Authors: Albert Alcalde, Giovanni Fantuzzi, Enrique Zuazua,
- Abstract summary: We prove that transformers can exactly interpolate datasets of finite input sequences in $mathbbRd$, $dgeq 2$, with corresponding output sequences of smaller or equal length.<n>Specifically, given $N$ sequences of arbitrary but finite lengths in $mathbbRd$ and output sequences of lengths $m1, dots, mN in mathcalN$, we construct a transformer with $mathcalO(sum_j=1N mj)$ blocks and $
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\mathcal{O}(d \sum_{j=1}^N m^j)$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.
Related papers
- Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality [52.424255020469595]
Transformers excel through content-addressable retrieval and the ability to exploit contexts, in principle, length.<n>We recast associative memory at the level of probability measures, treating a context as a distribution over unbounded tokens.<n>We show that a shallow measure-theoretic Transformer learns the recall-and-predict map under a spectral assumption on the input densities.
arXiv Detail & Related papers (2026-02-02T09:34:17Z) - Transmutation based Quantum Simulation for Non-unitary Dynamics [35.35971148847751]
We present a quantum algorithm for simulating dissipative diffusion dynamics generated by positive semidefinite operators of the form $A=Ldagger L$.<n>Our main tool is the Kannai transform, which represents the diffusion semigroup $e-TA$ as a Gaussian-weighted superposition of unitary wave propagators.
arXiv Detail & Related papers (2026-01-07T05:47:22Z) - Spectral-Window Hybrid (SWH) [0.0]
Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity.<n>We propose the textbfSpectral-Window Hybrid (SWH), an architecture that decouples sequence modeling into two textitparallel streams.<n>We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences.
arXiv Detail & Related papers (2026-01-04T00:31:36Z) - Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z) - Approximation Bounds for Transformer Networks with Application to Regression [9.549045683389085]
We explore the approximation capabilities of Transformer networks for H"older and Sobolev functions.<n>We establish novel upper bounds for standard Transformer networks approxing sequence-to-sequence mappings.<n>We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence H"older functions.
arXiv Detail & Related papers (2025-04-16T15:25:58Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data.
We first show that the self-attention mechanism obtains the minimal Laplacian regularization.
We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions.
We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z) - An Online Riemannian PCA for Stochastic Canonical Correlation Analysis [37.8212762083567]
We present an efficient algorithm (RSG+) for canonical correlation analysis (CCA) using a reparametrization of the projection matrices.
While the paper primarily focuses on the formulation and technical analysis of its properties, our experiments show that the empirical behavior on common datasets is quite promising.
arXiv Detail & Related papers (2021-06-08T23:38:29Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - Supervised Quantile Normalization for Low-rank Matrix Approximation [50.445371939523305]
We learn the parameters of quantile normalization operators that can operate row-wise on the values of $X$ and/or of its factorization $UV$ to improve the quality of the low-rank representation of $X$ itself.
We demonstrate the applicability of these techniques on synthetic and genomics datasets.
arXiv Detail & Related papers (2020-02-08T21:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.