An Introduction to Transformers
- URL: http://arxiv.org/abs/2304.10557v5
- Date: Thu, 8 Feb 2024 15:01:34 GMT
- Title: An Introduction to Transformers
- Authors: Richard E. Turner
- Abstract summary: transformer is a neural network component that can be used to learn useful sequences or sets of data-points.
In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture.
- Score: 23.915718146956355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformer is a neural network component that can be used to learn
useful representations of sequences or sets of data-points. The transformer has
driven recent advances in natural language processing, computer vision, and
spatio-temporal modelling. There are many introductions to transformers, but
most do not contain precise mathematical descriptions of the architecture and
the intuitions behind the design choices are often also missing. Moreover, as
research takes a winding path, the explanations for the components of the
transformer can be idiosyncratic. In this note we aim for a mathematically
precise, intuitive, and clean description of the transformer architecture. We
will not discuss training as this is rather standard. We assume that the reader
is familiar with fundamental topics in machine learning including multi-layer
perceptrons, linear transformations, softmax functions and basic probability.
Related papers
- Extracting Finite State Machines from Transformers [0.3069335774032178]
We investigate the trainability of transformers trained on regular languages from a mechanistic interpretability perspective.
We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state.
Our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation.
arXiv Detail & Related papers (2024-10-08T13:43:50Z) - Transformer Explainer: Interactive Learning of Text-Generative Models [65.91049787390692]
Transformer Explainer is an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model.
It runs a live GPT-2 instance locally in the user's browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together.
arXiv Detail & Related papers (2024-08-08T17:49:07Z) - Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals.
By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z) - Introduction to Transformers: an NLP Perspective [59.0241868728732]
We introduce basic concepts of Transformers and present key techniques that form the recent advances of these models.
This includes a description of the standard Transformer architecture, a series of model refinements, and common applications.
arXiv Detail & Related papers (2023-11-29T13:51:04Z) - Linear attention is (maybe) all you need (to understand transformer
optimization) [55.81555204646486]
We make progress towards understanding the subtleties of training Transformers by studying a simple yet canonicalized shallow Transformer model.
Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics.
arXiv Detail & Related papers (2023-10-02T10:48:42Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - Transformer visualization via dictionary learning: contextualized
embedding as a linear superposition of transformer factors [15.348047288817478]
We propose to use dictionary learning to open up "black boxes" as linear superpositions of transformer factors.
Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors.
We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work.
arXiv Detail & Related papers (2021-03-29T20:51:33Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.