Related papers: Flowformer: Linearizing Transformers with Conservation Flows

Flowformer: Linearizing Transformers with Conservation Flows

URL: http://arxiv.org/abs/2202.06258v1
Date: Sun, 13 Feb 2022 08:44:10 GMT
Title: Flowformer: Linearizing Transformers with Conservation Flows
Authors: Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long
Abstract summary: We linearize Transformers free from specific inductive biases based on the flow network theory. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions.
Score: 77.25101425464773
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation with attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning.

Related papers

Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks) By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z)
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow [0.0]
Generalized Attention Flow (GAF) is a novel feature attribution method for Transformer-based models. GAF integrates attention weights, their gradients, the maximum flow problem, and the barrier method to enhance the performance of feature attributions.
arXiv Detail & Related papers (2025-02-14T19:50:58Z)
Learning Monotonic Attention in Transducer for Streaming Generation [26.24357071901915]
We propose a learnable monotonic attention mechanism to handle non-monotonic alignments in Transducer-based streaming generation models. Our approach allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space.
arXiv Detail & Related papers (2024-11-26T07:19:26Z)
Verlet Flows: Exact-Likelihood Integrators for Flow-Based Generative Models [4.9425328004453375]
We present Verlet flows, a class of CNFs on an augmented state-space inspired by symplectic from Hamiltonian dynamics. Verlet flows provide exact-likelihood generative models which generalize coupled flow architectures from a non-continuous setting while imposing minimal expressivity constraints. On experiments over toy densities, we demonstrate that the variance of the commonly used Hutchinson trace estimator is unsuitable for importance sampling, whereas Verlet flows perform comparably to full autograd trace computations while being significantly faster.
arXiv Detail & Related papers (2024-05-05T03:47:56Z)
Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z)
Generative Flows with Invertible Attentions [135.23766216657745]
We introduce two types of invertible attention mechanisms for generative flow models. We exploit split-based attention mechanisms to learn the attention weights and input representations on every two splits of flow feature maps. Our method provides invertible attention modules with tractable Jacobian determinants, enabling seamless integration of it at any positions of the flow-based models.
arXiv Detail & Related papers (2021-06-07T20:43:04Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows [78.77808270452974]
SurVAE Flows is a modular framework for composable transformations that encompasses VAEs and normalizing flows. We show that several recently proposed methods, including dequantization and augmented normalizing flows, can be expressed as SurVAE Flows.
arXiv Detail & Related papers (2020-07-06T13:13:22Z)
Focus of Attention Improves Information Transfer in Visual Features [80.22965663534556]
This paper focuses on unsupervised learning for transferring visual information in a truly online setting. The computation of the entropy terms is carried out by a temporal process which yields online estimation of the entropy terms. In order to better structure the input probability distribution, we use a human-like focus of attention model.
arXiv Detail & Related papers (2020-06-16T15:07:25Z)
The Convolution Exponential and Generalized Sylvester Flows [82.18442368078804]
This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation. An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers. We show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10.
arXiv Detail & Related papers (2020-06-02T19:43:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.