Luna: Linear Unified Nested Attention
- URL: http://arxiv.org/abs/2106.01540v1
- Date: Thu, 3 Jun 2021 01:47:26 GMT
- Title: Luna: Linear Unified Nested Attention
- Authors: Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao
Ma, Luke Zettlemoyer
- Abstract summary: We propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions.
Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function.
As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly.
- Score: 71.66026714473482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quadratic computational and memory complexities of the Transformer's
attention mechanism have limited its scalability for modeling long sequences.
In this paper, we propose Luna, a linear unified nested attention mechanism
that approximates softmax attention with two nested linear attention functions,
yielding only linear (as opposed to quadratic) time and space complexity.
Specifically, with the first attention function, Luna packs the input sequence
into a sequence of fixed length. Then, the packed sequence is unpacked using
the second attention function. As compared to a more traditional attention
mechanism, Luna introduces an additional sequence with a fixed length as input
and an additional corresponding output, which allows Luna to perform attention
operation linearly, while also storing adequate contextual information. We
perform extensive evaluations on three benchmarks of sequence modeling tasks:
long-context sequence modeling, neural machine translation and masked language
modeling for large-scale pretraining. Competitive or even better experimental
results demonstrate both the effectiveness and efficiency of Luna compared to a
variety
Related papers
- Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [19.618556742380086]
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption.
To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention.
arXiv Detail & Related papers (2024-05-27T17:38:13Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models [20.78813311569383]
We present Lightning Attention, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits.
Specifically, we utilize the conventional attention mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks.
Various experiments are conducted on different model sizes and sequence lengths.
arXiv Detail & Related papers (2024-01-09T16:27:28Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Hard Non-Monotonic Attention for Character-Level Transduction [65.17388794270694]
We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
arXiv Detail & Related papers (2018-08-29T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.