Attention Enables Zero Approximation Error
- URL: http://arxiv.org/abs/2202.12166v1
- Date: Thu, 24 Feb 2022 16:06:01 GMT
- Title: Attention Enables Zero Approximation Error
- Authors: Zhiying Fang, Yidong Ouyang, Ding-Xuan Zhou, Guang Cheng
- Abstract summary: We show that a single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired encoder of the input with no error.
As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal.
- Score: 22.110336842946555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models have been widely applied in various aspects of daily
life. Many variant models based on deep learning structures have achieved even
better performances. Attention-based architectures have become almost
ubiquitous in deep learning structures. Especially, the transformer model has
now defeated the convolutional neural network in image classification tasks to
become the most widely used tool. However, the theoretical properties of
attention-based models are seldom considered. In this work, we show that with
suitable adaptations, the single-head self-attention transformer with a fixed
number of transformer encoder blocks and free parameters is able to generate
any desired polynomial of the input with no error. The number of transformer
encoder blocks is the same as the degree of the target polynomial. Even more
exciting, we find that these transformer encoder blocks in this model do not
need to be trained. As a direct consequence, we show that the single-head
self-attention transformer with increasing numbers of free parameters is
universal. These surprising theoretical results clearly explain the outstanding
performances of the transformer model and may shed light on future
modifications in real applications. We also provide some experiments to verify
our theoretical result.
Related papers
- Understanding the Failure Modes of Transformers through the Lens of Graph Neural Networks [0.0]
This article is a study of many of the observed failure modes of transformers through the lens of graph neural network (GNN) theory.<n>We first make the case that much of deep learning, including transformers, is about learnable information mixing and propagation.<n>In addition, we analyze how the causal nature of decoder-only transformers create interesting geometric properties in information propagation, resulting in predictable and potentially devastating failure modes.
arXiv Detail & Related papers (2025-12-09T22:57:23Z) - Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures [10.970776446566909]
This paper investigates the capabilities of transformers in solving unsupervised learning problems.<n>We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks.<n>We prove that transformers can approximate both the EM algorithm and a core component of spectral methods.
arXiv Detail & Related papers (2025-05-17T09:02:18Z) - Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights [47.62295798627317]
This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold.<n>We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold.<n>Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise.
arXiv Detail & Related papers (2025-05-06T05:41:46Z) - A Theory for Compressibility of Graph Transformers for Transductive Learning [6.298115235439078]
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks.
All train/test/validation samples are present during training, making them more akin to a semi-supervised task.
We establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed.
arXiv Detail & Related papers (2024-11-20T04:20:17Z) - On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - An Introduction to Transformers [23.915718146956355]
transformer is a neural network component that can be used to learn useful sequences or sets of data-points.
In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture.
arXiv Detail & Related papers (2023-04-20T14:54:19Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.