Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning
- URL: http://arxiv.org/abs/2410.20926v1
- Date: Mon, 28 Oct 2024 11:08:57 GMT
- Title: Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning
- Authors: Aosong Feng, Rex Ying, Leandros Tassiulas,
- Abstract summary: We propose to scale up the attention field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension.
We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention.
- Score: 20.51822826798248
- License:
- Abstract: As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based model is the mismatch between the limited-range modeling power of full attention and the long-range token dependency in the input sequence. In this work, we propose to scale up the attention receptive field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension. The resulting Tensorized Attention can be adopted as efficient transformer backbones to extend input context length with improved memory and time efficiency. We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention. Extensive experiments show that tensorized attention can be used to adapt pretrained LLMs with improved efficiency. Notably, Llama-8B with tensorization is trained under 32,768 context length and can steadily extrapolate to 128k length during inference with $11\times$ speedup, compared to full attention with FlashAttention-2.
Related papers
- Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [19.618556742380086]
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption.
To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention.
arXiv Detail & Related papers (2024-05-27T17:38:13Z) - Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers [18.331374727331077]
Time complexity of tensor attention is a significant obstacle to its utilization in transformers.
We prove that the backward gradient of attention training can be computed in almost linear time.
arXiv Detail & Related papers (2024-05-26T02:59:13Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - Efficient Long-Range Transformers: You Need to Attend More, but Not
Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans.
MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers.
Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z) - Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
arXiv Detail & Related papers (2023-02-21T18:29:25Z) - Learning Sequence Representations by Non-local Recurrent Neural Memory [61.65105481899744]
We propose a Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning.
Our model is able to capture long-range dependencies and latent high-level features can be distilled by our model.
Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.
arXiv Detail & Related papers (2022-07-20T07:26:15Z) - Hard Non-Monotonic Attention for Character-Level Transduction [65.17388794270694]
We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
arXiv Detail & Related papers (2018-08-29T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.