Related papers: UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices

UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices

URL: http://arxiv.org/abs/2412.02344v1
Date: Tue, 03 Dec 2024 10:04:15 GMT
Title: UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices
Authors: Seul-Ki Yeom, Tae-Ho Kim,
Abstract summary: Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging.<n>We introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization.
Score: 1.795366746592388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging due to high memory and computational demands. In this paper, we introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization, enabling seamless operation on resource-constrained platforms without compromising performance. Unlike traditional multi-head attention (MHA), which redundantly computes separate attention matrices for each head, Reuse Attention consolidates these computations into a shared attention matrix, significantly reducing memory overhead and computational complexity. Comprehensive experiments on ImageNet-1K and downstream tasks show that the proposed UniForm models leveraging Reuse Attention achieve state-of-the-art imagenet classification accuracy while outperforming existing attention mechanisms, such as Linear Attention and Flash Attention, in inference speed and memory scalability. Notably, UniForm-l achieves a 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference time on edge devices like the Jetson AGX Orin, representing up to a 5x speedup over competing benchmark methods. These results demonstrate the versatility of Reuse Attention across high-performance GPUs and edge platforms, paving the way for broader real-time applications

Related papers

iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation. We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z)
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [24.1144641404561]
We propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators. We show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario.
arXiv Detail & Related papers (2024-11-20T19:44:26Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
Dual Path Transformer with Partition Attention [26.718318398951933]
We present a novel attention mechanism, called dual attention, which is both efficient and effective. We evaluate the effectiveness of our model on several computer vision tasks, including image classification on ImageNet, object detection on COCO, and semantic segmentation on Cityscapes. The proposed DualFormer-XS achieves 81.5% top-1 accuracy on ImageNet, outperforming the recent state-of-the-artiT-XS by 0.6% top-1 accuracy with much higher throughput.
arXiv Detail & Related papers (2023-05-24T06:17:53Z)
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers. The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor. These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
Couplformer:Rethinking Vision Transformer with Coupling Attention Map [7.789667260916264]
The Transformer model has demonstrated its outstanding performance in the computer vision domain. We propose a novel memory economy attention mechanism named Couplformer, which decouples the attention map into two sub-matrices. Experiments show that the Couplformer can significantly decrease 28% memory consumption compared with regular Transformer.
arXiv Detail & Related papers (2021-12-10T10:05:35Z)
Edge AI without Compromise: Efficient, Versatile and Accurate Neurocomputing in Resistive Random-Access Memory [0.0]
We present NeuRRAM - the first multimodal edge AI chip using RRAM CIM. We show record energy-efficiency $5times$ - $8times$ better than prior art across various computational bit-precisions. This work paves a way towards building highly efficient and reconfigurable edge AI hardware platforms.
arXiv Detail & Related papers (2021-08-17T21:08:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.