Related papers: ATTACC the Quadratic Bottleneck of Attention Layers

ATTACC the Quadratic Bottleneck of Attention Layers

URL: http://arxiv.org/abs/2107.06419v1
Date: Tue, 13 Jul 2021 22:23:40 GMT
Title: ATTACC the Quadratic Bottleneck of Attention Layers
Authors: Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Tushar Krishna
Abstract summary: This paper introduces a new attention-tailored dataflow, termed FLAT, for deep neural network (DNN) accelerators. It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer. In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction compared to state-of-the-art edge and cloud accelerators.
Score: 3.2741800634280245
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Attention mechanisms form the backbone of state-of-the-art machine learning models for a variety of tasks. Deploying them on deep neural network (DNN) accelerators, however, is prohibitively challenging especially under long sequences. Operators in attention layers exhibit limited reuse and quadratic growth in memory footprint, leading to severe memory-boundedness. This paper introduces a new attention-tailored dataflow, termed FLAT, which leverages operator fusion, loop-nest optimizations, and interleaved execution. It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. We term FLAT-compatible accelerators ATTACC. In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction comparing to state-of-the-art edge and cloud accelerators.

Related papers

COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z)
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [24.1144641404561]
We propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators. We show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario.
arXiv Detail & Related papers (2024-11-20T19:44:26Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators [11.496631244103773]
"Tiny Shared Block (TSB)" integrates a small shared 1x1 convolution block into the Deep Neural Network architecture. TSB achieves over 20x inference accuracy gap improvement, over 5x training speedup, and weights-to-device mapping cost reduction.
arXiv Detail & Related papers (2024-05-08T20:53:38Z)
EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality. On the software side, we evaluate epitomes' latency and energy on PIM accelerators. We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
Self-Attentive Pooling for Efficient Deep Learning [6.822466048176652]
We propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers. We surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. Our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints.
arXiv Detail & Related papers (2022-09-16T00:35:14Z)
Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers. The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor. These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z)
Fast and Memory-Efficient Network Towards Efficient Image Super-Resolution [44.909233016062906]
We build a memory-efficient image super-resolution network (FMEN) for resource-constrained devices. FMEN runs 33% faster and reduces 74% memory consumption compared with the state-of-the-art EISR model: E-RFDN. FMEN-S achieves the lowest memory consumption and the second shortest runtime in NTIRE 2022 challenge on efficient super-resolution.
arXiv Detail & Related papers (2022-04-18T16:49:20Z)
Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks. As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices. We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Improving Computational Efficiency in Visual Reinforcement Learning via Stored Embeddings [89.63764845984076]
We present Stored Embeddings for Efficient Reinforcement Learning (SEER) SEER is a simple modification of existing off-policy deep reinforcement learning methods. We show that SEER does not degrade the performance of RLizable agents while significantly saving computation and memory.
arXiv Detail & Related papers (2021-03-04T08:14:10Z)
Improving Memory Utilization in Convolutional Neural Network Accelerators [16.340620299847384]
We propose a mapping method that allows activation layers to overlap and thus utilize the memory more efficiently. Experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%. For higher resolution de-noising networks, we achieve activation memory savings of 48.8%.
arXiv Detail & Related papers (2020-07-20T09:34:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.