Related papers: Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

URL: http://arxiv.org/abs/2012.02030v3
Date: Fri, 17 May 2024 13:30:15 GMT
Title: Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks
Authors: Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljačić,
Abstract summary: We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality.
Score: 33.07113523598028
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.

Related papers

FORCE: Feature-Oriented Representation with Clustering and Explanation [0.0]
We propose a SHAP based supervised deep learning framework FORCE. It relies on two-stage usage of SHAP values in the neural network architecture. We show that FORCE led to dramatic improvements in overall performance as compared to networks that did not incorporate the latent feature and attention framework.
arXiv Detail & Related papers (2025-04-07T22:05:50Z)
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning [48.90556054777393]
Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV. We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms. Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution.
arXiv Detail & Related papers (2025-04-06T00:37:36Z)
Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information. It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z)
Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks. We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem. We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z)
Self-STORM: Deep Unrolled Self-Supervised Learning for Super-Resolution Microscopy [55.2480439325792]
We introduce deep unrolled self-supervised learning, which alleviates the need for such data by training a sequence-specific, model-based autoencoder. Our proposed method exceeds the performance of its supervised counterparts.
arXiv Detail & Related papers (2024-03-25T17:40:32Z)
Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
Interpreting and Improving Attention From the Perspective of Large Kernel Convolution [51.06461246235176]
We introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large- Kernel convolution. LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings.
arXiv Detail & Related papers (2024-01-11T08:40:35Z)
Self-Supervised Implicit Attention: Guided Attention by The Model Itself [1.3406858660972554]
We propose Self-Supervised Implicit Attention (SSIA), a new approach that adaptively guides deep neural network models to gain attention by exploiting the properties of the models themselves. SSIAA is a novel attention mechanism that does not require any extra parameters, computation, or memory access costs during inference. Our implementation will be available on GitHub.
arXiv Detail & Related papers (2022-06-15T10:13:34Z)
Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval [51.53892300802014]
We show that supervised neural information retrieval models are prone to learning sparse attention patterns over passage tokens. Using a novel targeted synthetic data generation method, we teach neural IR to attend more uniformly and robustly to all entities in a given passage.
arXiv Detail & Related papers (2022-04-24T22:36:48Z)
Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)
Cost-effective Interactive Attention Learning with Neural Attention Processes [79.8115563067513]
We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL) IAL is prone to overfitting due to scarcity of human annotations, and requires costly retraining. We tackle these challenges by proposing a sample-efficient attention mechanism and a cost-effective reranking algorithm for instances and features.
arXiv Detail & Related papers (2020-06-09T17:36:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.