AtMan: Understanding Transformer Predictions Through Memory Efficient
Attention Manipulation
- URL: http://arxiv.org/abs/2301.08110v5
- Date: Sun, 5 Nov 2023 14:16:21 GMT
- Title: AtMan: Understanding Transformer Predictions Through Memory Efficient
Attention Manipulation
- Authors: Bj\"orn Deiseroth, Mayukh Deb, Samuel Weinbach, Manuel Brack, Patrick
Schramowski, Kristian Kersting
- Abstract summary: We present AtMan, which provides explanations of generative transformer models at almost no extra cost.
AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input.
Our experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics.
- Score: 25.577132500246886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative transformer models have become increasingly complex, with large
numbers of parameters and the ability to process multiple input modalities.
Current methods for explaining their predictions are resource-intensive. Most
crucially, they require prohibitively large amounts of extra memory, since they
rely on backpropagation which allocates almost twice as much GPU memory as the
forward pass. This makes it difficult, if not impossible, to use them in
production. We present AtMan that provides explanations of generative
transformer models at almost no extra cost. Specifically, AtMan is a
modality-agnostic perturbation method that manipulates the attention mechanisms
of transformers to produce relevance maps for the input with respect to the
output prediction. Instead of using backpropagation, AtMan applies a
parallelizable token-based search method based on cosine similarity
neighborhood in the embedding space. Our exhaustive experiments on text and
image-text benchmarks demonstrate that AtMan outperforms current
state-of-the-art gradient-based methods on several metrics while being
computationally efficient. As such, AtMan is suitable for use in large model
inference deployments.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - When to Use Efficient Self Attention? Profiling Text, Speech and Image
Transformer Variants [39.00433193973159]
We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.
We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models.
To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model.
arXiv Detail & Related papers (2023-06-14T17:59:02Z) - Scaling Transformer to 1M tokens and beyond with RMT [5.60052250541419]
A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size.
In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute.
Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy.
arXiv Detail & Related papers (2023-04-19T16:18:54Z) - AttMEMO : Accelerating Transformers with Memoization on Big Memory
Systems [10.585040856070941]
We introduce a novel embedding technique to find semantically similar inputs to identify computation similarity.
We enable 22% inference-latency reduction on average (up to 68%) with negligible loss in inference accuracy.
arXiv Detail & Related papers (2023-01-23T04:24:26Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.