Transformers from an Optimization Perspective
- URL: http://arxiv.org/abs/2205.13891v1
- Date: Fri, 27 May 2022 10:45:15 GMT
- Title: Transformers from an Optimization Perspective
- Authors: Yongyi Yang, Zengfeng Huang, David Wipf
- Abstract summary: We study the problem of finding an energy function underlying the Transformer model.
By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process.
This work contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
- Score: 24.78739299952529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models such as the Transformer are often constructed by
heuristics and experience. To provide a complementary foundation, in this work
we study the following problem: Is it possible to find an energy function
underlying the Transformer model, such that descent steps along this energy
correspond with the Transformer forward pass? By finding such a function, we
can reinterpret Transformers as the unfolding of an interpretable optimization
process across iterations. This unfolding perspective has been frequently
adopted in the past to elucidate more straightforward deep models such as MLPs
and CNNs; however, it has thus far remained elusive obtaining a similar
equivalence for more complex models with self-attention mechanisms like the
Transformer. To this end, we first outline several major obstacles before
providing companion techniques to at least partially address them,
demonstrating for the first time a close association between energy function
minimization and deep layers with self-attention. This interpretation
contributes to our intuition and understanding of Transformers, while
potentially laying the ground-work for new model designs.
Related papers
- Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance.
improved generalization ability occurs as the model memorizes the training samples.
We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - TransformerFAM: Feedback attention is working memory [18.005034679674274]
We propose a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations.
TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models.
arXiv Detail & Related papers (2024-04-14T07:43:45Z) - Introduction to Transformers: an NLP Perspective [59.0241868728732]
We introduce basic concepts of Transformers and present key techniques that form the recent advances of these models.
This includes a description of the standard Transformer architecture, a series of model refinements, and common applications.
arXiv Detail & Related papers (2023-11-29T13:51:04Z) - Transformer Fusion with Optimal Transport [25.022849817421964]
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities.
This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.
arXiv Detail & Related papers (2023-10-09T13:40:31Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.