Related papers: Attentions Under the Microscope: A Comparative Study of Resource Utilization for Variants of Self-Attention

Attentions Under the Microscope: A Comparative Study of Resource Utilization for Variants of Self-Attention

URL: http://arxiv.org/abs/2507.07247v1
Date: Wed, 09 Jul 2025 19:37:23 GMT
Title: Attentions Under the Microscope: A Comparative Study of Resource Utilization for Variants of Self-Attention
Authors: Zhengyu Tian, Anantha Padmanaban Krishna Kumar, Hemant Krishnakumar, Reza Rawassizadeh,
Abstract summary: We benchmark eight attention mechanisms in training GPT-2 architecture, measuring key metrics including training time, GPU memory usage, FLOPS, CPU usage, and power consumption.<n>Results reveal that attention mechanisms with optimized kernel implementations, including Flash Attention, achieve the best energy efficiency.<n>Our study highlights the importance of energy-aware benchmarking in attention design and provides a practical insight for selecting resource-efficient mechanisms.
Score: 0.18749305679160366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) and visual language models (VLMs) grow in scale and application, attention mechanisms have become a central computational bottleneck due to their high memory and time complexity. While many efficient attention variants have been proposed, there remains a lack of rigorous evaluation on their actual energy usage and hardware resource demands during training. In this work, we benchmark eight attention mechanisms in training GPT-2 architecture, measuring key metrics including training time, GPU memory usage, FLOPS, CPU usage, and power consumption. Our results reveal that attention mechanisms with optimized kernel implementations, including Flash Attention, Locality-Sensitive Hashing (LSH) Attention, and Multi-Head Latent Attention (MLA), achieve the best energy efficiency. We further show that lower GPU power alone does not guarantee reduced energy use, as training time plays an equally important role. Our study highlights the importance of energy-aware benchmarking in attention design and provides a practical insight for selecting resource-efficient mechanisms. All our codes are available at GitHub.

Related papers

FlashBias: Fast Computation of Attention with Bias [77.39043478894504]
This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.<n>FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$times$ speedup for AlphaFold, and over 2$times$ speedup for attention with bias in vision and language models without loss of accuracy.
arXiv Detail & Related papers (2025-05-17T15:12:50Z)
Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
A-VL: Adaptive Attention for Large Vision-Language Models [10.027871150748956]
Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential.<n>Current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models.<n>We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns.<n>We develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference.
arXiv Detail & Related papers (2024-09-23T09:22:59Z)
Interpreting and Improving Attention From the Perspective of Large Kernel Convolution [51.06461246235176]
We introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large- Kernel convolution.<n>LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings.
arXiv Detail & Related papers (2024-01-11T08:40:35Z)
Design Space Exploration of Low-Bit Quantized Neural Networks for Visual Place Recognition [26.213493552442102]
Visual Place Recognition (VPR) is a critical task for performing global re-localization in visual perception systems. Recently new works have focused on the recall@1 metric as a performance measure with limited focus on resource utilization. This has resulted in methods that use deep learning models too large to deploy on low powered edge devices. We study the impact of compact convolutional network architecture design in combination with full-precision and mixed-precision post-training quantization on VPR performance.
arXiv Detail & Related papers (2023-12-14T15:24:42Z)
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness. To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z)
Trends in Energy Estimates for Computing in AI/Machine Learning Accelerators, Supercomputers, and Compute-Intensive Applications [3.2634122554914]
We examine the computational energy requirements of different systems driven by the geometrical scaling law. We show that energy efficiency due to geometrical scaling is slowing down. At the application level, general-purpose AI-ML methods can be computationally energy intensive.
arXiv Detail & Related papers (2022-10-12T16:14:33Z)
Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)
Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers [5.4352987210173955]
This paper aims at increasing smartness in the software toolchain to exploit modern architectures in the best way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.
arXiv Detail & Related papers (2020-12-12T15:12:03Z)
Energy Aware Deep Reinforcement Learning Scheduling for Sensors Correlated in Time and Space [62.39318039798564]
We propose a scheduling mechanism capable of taking advantage of correlated information. The proposed mechanism is capable of determining the frequency with which sensors should transmit their updates. We show that our solution can significantly extend the sensors' lifetime.
arXiv Detail & Related papers (2020-11-19T09:53:27Z)
Resource-Efficient Neural Networks for Embedded Systems [23.532396005466627]
We provide an overview of the current state of the art of machine learning techniques. We focus on resource-efficient inference based on deep neural networks (DNNs), the predominant machine learning models of the past decade. We substantiate our discussion with experiments on well-known benchmark data sets using compression techniques.
arXiv Detail & Related papers (2020-01-07T14:17:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.