Flowformer: Linearizing Transformers with Conservation Flows
- URL: http://arxiv.org/abs/2202.06258v1
- Date: Sun, 13 Feb 2022 08:44:10 GMT
- Title: Flowformer: Linearizing Transformers with Conservation Flows
- Authors: Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long
- Abstract summary: We linearize Transformers free from specific inductive biases based on the flow network theory.
By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions.
- Score: 77.25101425464773
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformers based on the attention mechanism have achieved impressive
success in various areas. However, the attention mechanism has a quadratic
complexity, significantly impeding Transformers from dealing with numerous
tokens and scaling up to bigger models. Previous methods mainly utilize the
similarity decomposition and the associativity of matrix multiplication to
devise linear-time attention mechanisms. They avoid degeneration of attention
to a trivial distribution by reintroducing inductive biases such as the
locality, thereby at the expense of model generality and expressiveness. In
this paper, we linearize Transformers free from specific inductive biases based
on the flow network theory. We cast attention as the information flow
aggregated from the sources (values) to the sinks (results) through the learned
flow capacities (attentions). Within this framework, we apply the property of
flow conservation with attention and propose the Flow-Attention mechanism of
linear complexity. By respectively conserving the incoming flow of sinks for
source competition and the outgoing flow of sources for sink allocation,
Flow-Attention inherently generates informative attentions without using
specific inductive biases. Empowered by the Flow-Attention, Flowformer yields
strong performance in linear time for wide areas, including long sequence, time
series, vision, natural language, and reinforcement learning.
Related papers
- Learning Monotonic Attention in Transducer for Streaming Generation [26.24357071901915]
We propose a learnable monotonic attention mechanism to handle non-monotonic alignments in Transducer-based streaming generation models.
Our approach allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space.
arXiv Detail & Related papers (2024-11-26T07:19:26Z) - Verlet Flows: Exact-Likelihood Integrators for Flow-Based Generative Models [4.9425328004453375]
We present Verlet flows, a class of CNFs on an augmented state-space inspired by symplectic from Hamiltonian dynamics.
Verlet flows provide exact-likelihood generative models which generalize coupled flow architectures from a non-continuous setting while imposing minimal expressivity constraints.
On experiments over toy densities, we demonstrate that the variance of the commonly used Hutchinson trace estimator is unsuitable for importance sampling, whereas Verlet flows perform comparably to full autograd trace computations while being significantly faster.
arXiv Detail & Related papers (2024-05-05T03:47:56Z) - Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability.
We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention.
Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z) - Generative Flows with Invertible Attentions [135.23766216657745]
We introduce two types of invertible attention mechanisms for generative flow models.
We exploit split-based attention mechanisms to learn the attention weights and input representations on every two splits of flow feature maps.
Our method provides invertible attention modules with tractable Jacobian determinants, enabling seamless integration of it at any positions of the flow-based models.
arXiv Detail & Related papers (2021-06-07T20:43:04Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z) - SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows [78.77808270452974]
SurVAE Flows is a modular framework for composable transformations that encompasses VAEs and normalizing flows.
We show that several recently proposed methods, including dequantization and augmented normalizing flows, can be expressed as SurVAE Flows.
arXiv Detail & Related papers (2020-07-06T13:13:22Z) - Focus of Attention Improves Information Transfer in Visual Features [80.22965663534556]
This paper focuses on unsupervised learning for transferring visual information in a truly online setting.
The computation of the entropy terms is carried out by a temporal process which yields online estimation of the entropy terms.
In order to better structure the input probability distribution, we use a human-like focus of attention model.
arXiv Detail & Related papers (2020-06-16T15:07:25Z) - The Convolution Exponential and Generalized Sylvester Flows [82.18442368078804]
This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation.
An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers.
We show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10.
arXiv Detail & Related papers (2020-06-02T19:43:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.