EL-Attention: Memory Efficient Lossless Attention for Generation
- URL: http://arxiv.org/abs/2105.04779v1
- Date: Tue, 11 May 2021 04:37:52 GMT
- Title: EL-Attention: Memory Efficient Lossless Attention for Generation
- Authors: Yu Yan, Jiusheng Chen, Weizhen Qi, Nikhil Bhendawade, Yeyun Gong, Nan
Duan and Ruofei Zhang
- Abstract summary: We propose memory-efficient lossless attention (called EL-attention) to address this issue.
It avoids heavy operations for building multi-head keys and values, with no requirements of using cache.
We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks.
- Score: 27.59275177303199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer model with multi-head attention requires caching intermediate
results for efficient inference in generation tasks. However, cache brings new
memory-related costs and prevents leveraging larger batch size for faster
speed. We propose memory-efficient lossless attention (called EL-attention) to
address this issue. It avoids heavy operations for building multi-head keys and
values, with no requirements of using cache. EL-attention constructs an
ensemble of attention results by expanding query while keeping key and value
shared. It produces the same result as multi-head attention with less GPU
memory and faster inference speed. We conduct extensive experiments on
Transformer, BART, and GPT-2 for summarization and question generation tasks.
The results show EL-attention speeds up existing models by 1.6x to 5.3x without
accuracy loss.
Related papers
- ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - EfficientViT: Memory Efficient Vision Transformer with Cascaded Group
Attention [44.148667664413004]
We propose a family of high-speed vision transformers named EfficientViT.
We find that the speed of existing transformer models is commonly bounded by memory inefficient operations.
To address this, we present a cascaded group attention module feeding attention heads with different splits.
arXiv Detail & Related papers (2023-05-11T17:59:41Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.