Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences
- URL: http://arxiv.org/abs/2210.11794v1
- Date: Fri, 21 Oct 2022 08:13:34 GMT
- Title: Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences
- Authors: Aosong Feng, Irene Li, Yuang Jiang, Rex Ying
- Abstract summary: textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
- Score: 16.066338004414092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient Transformers have been developed for long sequence modeling, due to
their subquadratic memory and time complexity. Sparse Transformer is a popular
approach to improving the efficiency of Transformers by restricting
self-attention to locations specified by the predefined sparse patterns.
However, leveraging sparsity may sacrifice expressiveness compared to
full-attention, when important token correlations are multiple hops away. To
combine advantages of both the efficiency of sparse transformer and the
expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a
new state-of-the-art efficient Transformer. Diffuser incorporates all token
interactions within one attention layer while maintaining low computation and
memory costs. The key idea is to expand the receptive field of sparse attention
using Attention Diffusion, which computes multi-hop token correlations based on
all paths between corresponding disconnected tokens, besides attention among
neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a
universal sequence approximator for sequence-to-sequence modeling, and
investigate its ability to approximate full-attention by analyzing the graph
expander property from the spectral perspective. Experimentally, we investigate
the effectiveness of Diffuser with extensive evaluations, including language
modeling, image modeling, and Long Range Arena (LRA). Evaluation results show
that Diffuser achieves improvements by an average of 0.94% on text
classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings
compared to state-of-the-art benchmarks, which demonstrates superior
performance of Diffuser in both expressiveness and efficiency aspects.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z) - Random Feature Attention [69.4671822971207]
We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function.
RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism.
Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
arXiv Detail & Related papers (2021-03-03T02:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.