Related papers: Universal Approximation with Softmax Attention

Universal Approximation with Softmax Attention

URL: http://arxiv.org/abs/2504.15956v1
Date: Tue, 22 Apr 2025 14:51:33 GMT
Title: Universal Approximation with Softmax Attention
Authors: Jerry Yao-Chieh Hu, Hude Liu, Hong-Yu Chen, Weimin Wu, Han Liu,
Abstract summary: We prove that both (i) two-layer self-attention and (ii) one-layer self-attention are universal approximators for continuous sequence-to-sequence functions on compact domains.
Score: 10.857177487536656
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.

Related papers

Attention Mechanism, Max-Affine Partition, and Universal Approximation [11.61656225057345]
We prove that a single self-attention layer is capable of approximating any continuous function on a compact domain under the $L_infty$-norm. We also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
arXiv Detail & Related papers (2025-04-28T15:31:45Z)
MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation [5.298814565953444]
Relative position encoding methods address the length extrapolation challenge exclusively through the implementation of a single kernel function. This study proposes a novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions. We present two distinct versions of our method: a parameter-free variant that requires no new learnable parameters, and a parameterized variant capable of integrating state-of-the-art techniques.
arXiv Detail & Related papers (2024-03-26T13:38:06Z)
Prompting a Pretrained Transformer Can Be a Universal Approximator [105.59562522323274]
We show that much smaller pretrained models than previously thought can be universal approximators when prefixed. We also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
arXiv Detail & Related papers (2024-02-22T18:12:48Z)
CSformer: Combining Channel Independence and Mixing for Robust Multivariate Time Series Forecasting [3.6814181034608664]
We propose a strategy of channel independence followed by mixing in time series analysis.<n>We introduce CSformer, a novel framework featuring a two-stage multiheaded self-attention mechanism.<n>Our framework effectively incorporates sequence and channel adapters, significantly improving the model's ability to identify important information.
arXiv Detail & Related papers (2023-12-11T09:10:38Z)
Online Learning with Adversaries: A Differential-Inclusion Analysis [52.43460995467893]
We introduce an observation-matrix-based framework for fully asynchronous online Federated Learning with adversaries. Our main result is that the proposed algorithm almost surely converges to the desired mean $mu.$ We derive this convergence using a novel differential-inclusion-based two-timescale analysis.
arXiv Detail & Related papers (2023-04-04T04:32:29Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task. We employ a restrictive CNN with small and non-overlapping RF for token representation. In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)
Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics. We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z)
Rethinking Attention with Performers [45.47365397101224]
We introduce Performers, Transformer architectures which can estimate full-rank-attention Transformers with provable accuracy. Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
arXiv Detail & Related papers (2020-09-30T17:09:09Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.