The Shaped Transformer: Attention Models in the Infinite Depth-and-Width
Limit
- URL: http://arxiv.org/abs/2306.17759v2
- Date: Sat, 9 Dec 2023 19:59:40 GMT
- Title: The Shaped Transformer: Attention Models in the Infinite Depth-and-Width
Limit
- Authors: Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann,
Chris Maddison, Daniel M. Roy
- Abstract summary: We study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width.
To achieve a well-defined limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity.
We show, through simulations, that the differential equation (SDE) indexed by the depth-to-width ratio provides a surprisingly good description of the corresponding finite-size model.
- Score: 38.89510345229949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In deep learning theory, the covariance matrix of the representations serves
as a proxy to examine the network's trainability. Motivated by the success of
Transformers, we study the covariance matrix of a modified Softmax-based
attention model with skip connections in the proportional limit of
infinite-depth-and-width. We show that at initialization the limiting
distribution can be described by a stochastic differential equation (SDE)
indexed by the depth-to-width ratio. To achieve a well-defined stochastic
limit, the Transformer's attention mechanism is modified by centering the
Softmax output at identity, and scaling the Softmax logits by a width-dependent
temperature parameter. We examine the stability of the network through the
corresponding SDE, showing how the scale of both the drift and diffusion can be
elegantly controlled with the aid of residual connections. The existence of a
stable SDE implies that the covariance structure is well-behaved, even for very
large depth and width, thus preventing the notorious issues of rank degeneracy
in deep attention models. Finally, we show, through simulations, that the SDE
provides a surprisingly good description of the corresponding finite-size
model. We coin the name shaped Transformer for these architectural
modifications.
Related papers
- Exploring the ability of the Deep Ritz Method to model strain localization as a sharp discontinuity [0.0]
We use a regularized strong discontinuity kinematics within a variational setting for elastoplastic solids.
The corresponding mathematical model is discretized using Artificial Neural Networks (ANNs)
As a proof of concept, we show through both 1D and 2D numerical examples that the computational modeling of strain localization for elastoplastic solids is feasible.
arXiv Detail & Related papers (2024-09-20T05:57:50Z) - AdjointDEIS: Efficient Gradients for Diffusion Models [2.0795007613453445]
We show that continuous adjoint equations for diffusion SDEs actually simplify to a simple ODE.
We also demonstrate the effectiveness of AdjointDEIS for guided generation with an adversarial attack in the form of the face morphing problem.
arXiv Detail & Related papers (2024-05-23T19:51:33Z) - Transolver: A Fast Transformer Solver for PDEs on General Geometries [66.82060415622871]
We present Transolver, which learns intrinsic physical states hidden behind discretized geometries.
By calculating attention to physics-aware tokens encoded from slices, Transovler can effectively capture intricate physical correlations.
Transolver achieves consistent state-of-the-art with 22% relative gain across six standard benchmarks and also excels in large-scale industrial simulations.
arXiv Detail & Related papers (2024-02-04T06:37:38Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - High-dimensional scaling limits and fluctuations of online least-squares SGD with smooth covariance [16.652085114513273]
We derive high-dimensional scaling limits and fluctuations for the online least-squares Gradient Descent (SGD) algorithm.
Our results have several applications, including characterization of the limiting mean-square estimation or prediction errors and their fluctuations.
arXiv Detail & Related papers (2023-04-03T03:50:00Z) - Convex Bounds on the Softmax Function with Applications to Robustness
Verification [69.09991317119679]
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well.
This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models.
arXiv Detail & Related papers (2023-03-03T05:07:02Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at
Initialization [13.872374586700767]
Recent work has shown that shaping the activation function as network depth grows large is necessary.
We identify the precise scaling of the activation function necessary to arrive at a nontrivial limit.
We recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
arXiv Detail & Related papers (2022-06-06T17:45:07Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Deep Learning Approximation of Diffeomorphisms via Linear-Control
Systems [91.3755431537592]
We consider a control system of the form $dot x = sum_i=1lF_i(x)u_i$, with linear dependence in the controls.
We use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points.
arXiv Detail & Related papers (2021-10-24T08:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.