A Unified Perspective on the Dynamics of Deep Transformers
- URL: http://arxiv.org/abs/2501.18322v1
- Date: Thu, 30 Jan 2025 13:04:54 GMT
- Title: A Unified Perspective on the Dynamics of Deep Transformers
- Authors: Valérie Castin, Pierre Ablin, José Antonio Carrillo, Gabriel Peyré,
- Abstract summary: We study the evolution of data anisotropy through a deep Transformer.
We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
- Score: 24.094975798576783
- License:
- Abstract: Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Are queries and keys always relevant? A case study on Transformer wave functions [0.0]
dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers.
We explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions.
arXiv Detail & Related papers (2024-05-29T08:32:37Z) - EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer.
It provides a unified theoretical framework to formulate both semantic difference and positional difference.
It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z) - Geometric Dynamics of Signal Propagation Predict Trainability of
Transformers [22.25628914395565]
We investigate forward signal propagation and gradient back propagation in deep, randomly transformers.
Our approach treats the evolution of $n tokens as they propagate through the transformer layers.
We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
arXiv Detail & Related papers (2024-03-05T01:30:34Z) - Advective Diffusion Transformers for Topological Generalization in Graph
Learning [69.2894350228753]
We show how graph diffusion equations extrapolate and generalize in the presence of varying graph topologies.
We propose a novel graph encoder backbone, Advective Diffusion Transformer (ADiT), inspired by advective graph diffusion equations.
arXiv Detail & Related papers (2023-10-10T08:40:47Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - TransformerG2G: Adaptive time-stepping for learning temporal graph
embeddings using transformers [2.2120851074630177]
We develop a graph embedding model with uncertainty quantification, TransformerG2G, to learn temporal dynamics of temporal graphs.
Our experiments demonstrate that the proposed TransformerG2G model outperforms conventional multi-step methods.
By examining the attention weights, we can uncover temporal dependencies, identify influential elements, and gain insights into the complex interactions within the graph structure.
arXiv Detail & Related papers (2023-07-05T18:34:22Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - tvGP-VAE: Tensor-variate Gaussian Process Prior Variational Autoencoder [0.0]
tvGP-VAE is able to explicitly model correlation via the use of kernel functions.
We show that the choice of which correlation structures to explicitly represent in the latent space has a significant impact on model performance.
arXiv Detail & Related papers (2020-06-08T17:59:13Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.