Related papers: Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

URL: http://arxiv.org/abs/2310.12442v1
Date: Thu, 19 Oct 2023 03:32:05 GMT
Title: Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
Authors: Qingru Zhang, Dhananjay Ram, Cole Hawkins, Sheng Zha, Tuo Zhao
Abstract summary: We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
Score: 36.75562615596186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest.

Related papers

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies [0.0]
We compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. We then demonstrate that long context length does yield better performance, albeit application-dependent. Inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies.
arXiv Detail & Related papers (2023-02-13T09:47:31Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity. Recent studies have shown the potential of Transformer to increase the prediction capacity. We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.