Related papers: AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems

AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems

URL: http://arxiv.org/abs/2301.09262v2
Date: Mon, 17 Apr 2023 20:06:38 GMT
Title: AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems
Authors: Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, and Dong Li
Abstract summary: We introduce a novel embedding technique to find semantically similar inputs to identify computation similarity. We enable 22% inference-latency reduction on average (up to 68%) with negligible loss in inference accuracy.
Score: 10.585040856070941
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models gain popularity because of their superior inference accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing works on transformer inference acceleration have limitations caused by either the modification of transformer architectures or the need of specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the self-attention mechanism in transformers without the above limitations. Built upon a unique observation that there is rich similarity in attention computation across inference sequences, we build a memoization database that leverages the emerging big memory system. We introduce a novel embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 22% inference-latency reduction on average (up to 68%) with negligible loss in inference accuracy.

Related papers

Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context. We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise. It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
Masked Mixers for Language Generation and Retrieval [0.0]
We observe poor input representation accuracy in transformers and more accurate representation in what we term masked mixers. A small masked mixer is shown to outperform a large and near state-of-the-art transformer-based retrieval model.
arXiv Detail & Related papers (2024-09-02T22:17:18Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient. We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z)
AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation [25.577132500246886]
We present AtMan, which provides explanations of generative transformer models at almost no extra cost. AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input. Our experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics.
arXiv Detail & Related papers (2023-01-19T15:01:00Z)
Recurrent Memory Transformer [0.3529736140137003]
We study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer) We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing.
arXiv Detail & Related papers (2022-07-14T13:00:22Z)
Couplformer:Rethinking Vision Transformer with Coupling Attention Map [7.789667260916264]
The Transformer model has demonstrated its outstanding performance in the computer vision domain. We propose a novel memory economy attention mechanism named Couplformer, which decouples the attention map into two sub-matrices. Experiments show that the Couplformer can significantly decrease 28% memory consumption compared with regular Transformer.
arXiv Detail & Related papers (2021-12-10T10:05:35Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.