Related papers: Attention Mechanism, Max-Affine Partition, and Universal Approximation

Attention Mechanism, Max-Affine Partition, and Universal Approximation

URL: http://arxiv.org/abs/2504.19901v1
Date: Mon, 28 Apr 2025 15:31:45 GMT
Title: Attention Mechanism, Max-Affine Partition, and Universal Approximation
Authors: Hude Liu, Jerry Yao-Chieh Hu, Zhao Song, Han Liu,
Abstract summary: We prove that a single self-attention layer is capable of approximating any continuous function on a compact domain under the $L_infty$-norm.<n>We also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
Score: 11.61656225057345
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.

Related papers

Universal Approximation with Softmax Attention [10.857177487536656]
We prove that both (i) two-layer self-attention and (ii) one-layer self-attention are universal approximators for continuous sequence-to-sequence functions on compact domains.
arXiv Detail & Related papers (2025-04-22T14:51:33Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Some new considerations about the $ν$-function [0.0]
We show that the $nu$-function plays the role of the normalization function of generalized hypergeometric coherent states for quantum systems with a continuous spectrum. To our knowledge, the results obtained by us do not appear in the literature.
arXiv Detail & Related papers (2024-09-09T10:08:17Z)
Facility Location Games with Scaling Effects [63.421996606381164]
We take the classic facility location problem and consider a variation in which each agent's individual cost function is equal to their distance from the facility multiplied by a scaling factor.<n>We focus on the objectives of total and maximum cost, describing the computation of the optimal solution.<n>We characterize the conditions on scaling functions which ensure that agents have single-peaked preferences.
arXiv Detail & Related papers (2024-02-29T07:08:18Z)
Minimum Width for Deep, Narrow MLP: A Diffeomorphism Approach [3.218087085276242]
We propose a framework that simplifies finding the minimum width for deep, narrows into determining a purely geometrical function denoted as $w(d_x, d_y)$. We provide an upper bound for the minimum width, given by $namemax (2d_x+1, d_y) + alpha(sigma)$, where $0 leq alpha(sigma) leq 2$ represents a constant depending on the activation function.
arXiv Detail & Related papers (2023-08-30T08:58:23Z)
Monotone deep Boltzmann machines [86.50247625239406]
Deep Boltzmann machines (DBMs) are multi-layered probabilistic models governed by a pairwise energy function. We develop a new class of restricted model, the monotone DBM, which allows for arbitrary self-connection in each layer. We show that a particular choice of activation results in a fixed-point iteration that gives a variational mean-field solution.
arXiv Detail & Related papers (2023-07-11T03:02:44Z)
Online Learning with Adversaries: A Differential-Inclusion Analysis [52.43460995467893]
We introduce an observation-matrix-based framework for fully asynchronous online Federated Learning with adversaries. Our main result is that the proposed algorithm almost surely converges to the desired mean $mu.$ We derive this convergence using a novel differential-inclusion-based two-timescale analysis.
arXiv Detail & Related papers (2023-04-04T04:32:29Z)
Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces. We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z)
Universal Approximation Property of Neural Ordinary Differential Equations [19.861764482790544]
We show that NODEs can form an $Lp$-universal approximator for continuous maps under certain conditions. We also show their stronger approximation property, namely the $sup$-universality for approximating a large class of diffeomorphisms.
arXiv Detail & Related papers (2020-12-04T05:53:21Z)
Minimum Width for Universal Approximation [91.02689252671291]
We prove that the minimum width required for the universal approximation of the $Lp$ functions is exactly $maxd_x+1,d_y$. We also prove that the same conclusion does not hold for the uniform approximation with ReLU, but does hold with an additional threshold activation function.
arXiv Detail & Related papers (2020-06-16T01:24:21Z)
Optimal Bounds between $f$-Divergences and Integral Probability Metrics [8.401473551081748]
Families of $f$-divergences and Integral Probability Metrics are widely used to quantify similarity between probability distributions. We systematically study the relationship between these two families from the perspective of convex duality. We obtain new bounds while also recovering in a unified manner well-known results, such as Hoeffding's lemma.
arXiv Detail & Related papers (2020-06-10T17:39:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.