Redesigning the Transformer Architecture with Insights from
Multi-particle Dynamical Systems
- URL: http://arxiv.org/abs/2109.15142v2
- Date: Sun, 3 Oct 2021 07:21:07 GMT
- Title: Redesigning the Transformer Architecture with Insights from
Multi-particle Dynamical Systems
- Authors: Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti and Tanmoy
Chakraborty
- Abstract summary: We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations.
We formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers.
We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks.
- Score: 32.86421107987556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Transformer and its variants have been proven to be efficient sequence
learners in many different domains. Despite their staggering success, a
critical issue has been the enormous number of parameters that must be trained
(ranging from $10^7$ to $10^{11}$) along with the quadratic complexity of
dot-product attention. In this work, we investigate the problem of
approximating the two central components of the Transformer -- multi-head
self-attention and point-wise feed-forward transformation, with reduced
parameter space and computational complexity. We build upon recent developments
in analyzing deep neural networks as numerical solvers of ordinary differential
equations. Taking advantage of an analogy between Transformer stages and the
evolution of a dynamical system of multiple interacting particles, we formulate
a temporal evolution scheme, TransEvolve, to bypass costly dot-product
attention over multiple stacked layers. We perform exhaustive experiments with
TransEvolve on well-known encoder-decoder as well as encoder-only tasks. We
observe that the degree of approximation (or inversely, the degree of parameter
reduction) has different effects on the performance, depending on the task.
While in the encoder-decoder regime, TransEvolve delivers performances
comparable to the original Transformer, in encoder-only tasks it consistently
outperforms Transformer along with several subsequent variants.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Learning with SASQuaTCh: a Novel Variational Quantum Transformer Architecture with Kernel-Based Self-Attention [0.464982780843177]
We show that quantum circuits can efficiently express a self-attention mechanism through the perspective of kernel-based operator learning.
In this work, we are able to represent deep layers of a vision transformer network using simple gate operations and a set of multi-dimensional quantum Fourier transforms.
We analyze our novel variational quantum circuit, which we call Self-Attention Sequential Quantum Transformer Channel (SASTQuaCh), and demonstrate its utility on simplified classification problems.
arXiv Detail & Related papers (2024-03-21T18:00:04Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.