Transformers as Support Vector Machines
- URL: http://arxiv.org/abs/2308.16898v3
- Date: Thu, 22 Feb 2024 18:38:14 GMT
- Title: Transformers as Support Vector Machines
- Authors: Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet
Oymak
- Abstract summary: We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
- Score: 54.642793677472724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since its inception in "Attention Is All You Need", transformer architecture
has led to revolutionary advancements in NLP. The attention layer within the
transformer admits a sequence of input tokens $X$ and makes them interact
through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where
$(K,Q)$ are the trainable key-query parameters. In this work, we establish a
formal equivalence between the optimization geometry of self-attention and a
hard-margin SVM problem that separates optimal input tokens from non-optimal
tokens using linear constraints on the outer-products of token pairs. This
formalism allows us to characterize the implicit bias of 1-layer transformers
optimized with gradient descent: (1) Optimizing the attention layer with
vanishing regularization, parameterized by $(K,Q)$, converges in direction to
an SVM solution minimizing the nuclear norm of the combined parameter
$W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm
objective. We characterize this convergence, highlighting that it can occur
toward locally-optimal directions rather than global ones. (2) Complementing
this, we prove the local/global directional convergence of gradient descent
under suitable geometric conditions. Importantly, we show that
over-parameterization catalyzes global convergence by ensuring the feasibility
of the SVM problem and by guaranteeing a benign optimization landscape devoid
of stationary points. (3) While our theory applies primarily to linear
prediction heads, we propose a more general SVM equivalence that predicts the
implicit bias with nonlinear heads. Our findings are applicable to arbitrary
datasets and their validity is verified via experiments. We also introduce
several open problems and research directions. We believe these findings
inspire the interpretation of transformers as a hierarchy of SVMs that
separates and selects optimal tokens.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance.
We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary.
We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z) - p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data.
We first show that the self-attention mechanism obtains the minimal Laplacian regularization.
We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Kernel Support Vector Machine Classifiers with the $\ell_0$-Norm Hinge
Loss [3.007949058551534]
Support Vector Machine (SVM) has been one of the most successful machine learning techniques for binary classification problems.
This paper is concentrated on vectors with hinge loss (referred as $ell$-KSVM), which is a composite function of hinge loss and $ell_$norm.
Experiments on the synthetic and real datasets are illuminated to show that $ell_$-KSVM can achieve comparable accuracy compared with the standard KSVM.
arXiv Detail & Related papers (2023-06-24T14:52:44Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Hybrid Model-based / Data-driven Graph Transform for Image Coding [54.31406300524195]
We present a hybrid model-based / data-driven approach to encode an intra-prediction residual block.
The first $K$ eigenvectors of a transform matrix are derived from a statistical model, e.g., the asymmetric discrete sine transform (ADST) for stability.
Using WebP as a baseline image, experimental results show that our hybrid graph transform achieved better energy compaction than default discrete cosine transform (DCT) and better stability than KLT.
arXiv Detail & Related papers (2022-03-02T15:36:44Z) - Provably Efficient Convergence of Primal-Dual Actor-Critic with
Nonlinear Function Approximation [15.319335698574932]
We show the first efficient convergence result with primal-dual actor-critic with a convergence of $mathcalOleft ascent(Nright)Nright)$ under Polyian sampling.
Results on Open GymAI continuous control tasks.
arXiv Detail & Related papers (2022-02-28T15:16:23Z) - A Precise High-Dimensional Asymptotic Theory for Boosting and
Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data.
Under a class of statistical models, we provide an exact analysis of the universality error of boosting.
We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.