Related papers: Fastformer: Additive Attention Can Be All You Need

Fastformer: Additive Attention Can Be All You Need

URL: http://arxiv.org/abs/2108.09084v2
Date: Mon, 23 Aug 2021 13:11:51 GMT
Title: Fastformer: Additive Attention Can Be All You Need
Authors: Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang
Abstract summary: We propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts. In this way, Fastformer can achieve effective context modeling with linear complexity.
Score: 51.79399904527525
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.

Related papers

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z)
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks. We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z)
Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding [40.322453628755376]
General efficient transformers are challenging to be directly adapted to model document. Fast-StrucTexT is an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture. Our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T02:42:35Z)
Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation. MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation. We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z)
Transformer-F: A Transformer network with effective methods for learning universal sentence representation [8.225067988604351]
The Transformer model is widely used in natural language processing for sentence representation. In this paper, two approaches are introduced to improve the performance of Transformers.
arXiv Detail & Related papers (2021-07-02T03:20:11Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs. We replace the self-attention sublayers with simple linear transformations that "mix" input tokens. The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z)
Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time. We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.