Related papers: Attention Mechanism with Energy-Friendly Operations

Attention Mechanism with Energy-Friendly Operations

URL: http://arxiv.org/abs/2204.13353v1
Date: Thu, 28 Apr 2022 08:50:09 GMT
Title: Attention Mechanism with Energy-Friendly Operations
Authors: Yu Wan, Baosong Yang, Dayiheng Liu, Rong Xiao, Derek F. Wong, Haibo Zhang, Boxing Chen, Lidia S. Chao
Abstract summary: We rethink attention mechanism from the energy consumption aspects. We build a novel attention model by replacing multiplications with either selective operations or additions. Empirical results on three machine translation tasks demonstrate that the proposed model achieves competitable accuracy.
Score: 61.58748425876866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention mechanism has become the dominant module in natural language processing models. It is computationally intensive and depends on massive power-hungry multiplications. In this paper, we rethink variants of attention mechanism from the energy consumption aspects. After reaching the conclusion that the energy costs of several energy-friendly operations are far less than their multiplication counterparts, we build a novel attention model by replacing multiplications with either selective operations or additions. Empirical results on three machine translation tasks demonstrate that the proposed model, against the vanilla one, achieves competitable accuracy while saving 99\% and 66\% energy during alignment calculation and the whole attention procedure. Code is available at: https://github.com/NLP2CT/E-Att.

Related papers

EcoTransformer: Attention without Multiplication [11.468937445949088]
We propose a new Transformer architecture EcoTransformer.<n>The new attention score calculation is free of matrix multiplication.<n>It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks.
arXiv Detail & Related papers (2025-07-27T01:32:54Z)
Curse of High Dimensionality Issue in Transformer for Long-context Modeling [31.257769500741006]
We propose textitDynamic Group Attention (DGA) to reduce redundancy by aggregating less important tokens during attention computation.<n>Our results show that our DGA significantly reduces computational costs while maintaining competitive performance.
arXiv Detail & Related papers (2025-05-28T08:34:46Z)
Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers [5.1210823165448]
Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking transformers.
arXiv Detail & Related papers (2024-12-07T05:15:05Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers [0.0]
We replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption.
arXiv Detail & Related papers (2023-10-03T13:34:21Z)
On Feature Diversity in Energy-based Models [98.78384185493624]
An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs.
arXiv Detail & Related papers (2023-06-02T12:30:42Z)
Energy Transformer [64.22957136952725]
Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function.
arXiv Detail & Related papers (2023-02-14T18:51:22Z)
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones. We find that without any input-dependent attention, all models achieve competitive performance. We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z)
A Transistor Operations Model for Deep Learning Energy Consumption Scaling [14.856688747814912]
Deep Learning (DL) has transformed the automation of a wide range of industries and finds increasing ubiquity in society. The increasing complexity of DL models and its widespread adoption has led to the energy consumption doubling every 3-4 months. Current FLOPs and MACs based methods only consider the linear operations. We develop a bottom-level Transistor Operations (TOs) method to expose the role of activation functions and neural network structure in energy consumption scaling with DL model configuration.
arXiv Detail & Related papers (2022-05-30T12:42:33Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention [7.967230034960396]
We evaluate whether various active-memory mechanisms could replace self-attention in a Transformer. Experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling. For some specific algorithmic tasks, active-memory mechanisms alone outperform both self-attention and a combination of the two.
arXiv Detail & Related papers (2019-12-27T02:01:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.