Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
- URL: http://arxiv.org/abs/2407.12893v1
- Date: Wed, 17 Jul 2024 11:15:16 GMT
- Title: Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
- Authors: Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad,
- Abstract summary: We introduce a novel algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access.
With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention matrix at run time.
Also propose integer-based head pruning to detect and prune unimportant heads at an early stage at run time.
- Score: 1.0919012968294923
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the world of deep learning, Transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the deployment of these models in real-time applications, particularly on edge devices, poses significant challenges due to their quadratic computational intensity and memory demands. To overcome these challenges we introduce a novel Hybrid Dynamic Pruning (HDP), an efficient algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access. With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention matrix at run time, also propose integer-based head pruning to detect and prune unimportant heads at an early stage at run time. Also we propose an approximation method that reduces attention computations. To efficiently support these methods with lower latency and power efficiency, we propose a HDP co-processor architecture.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Dynamic Range Reduction via Branch-and-Bound [1.533133219129073]
Key strategy to enhance hardware accelerators is the reduction of precision in arithmetic operations.
This paper introduces a fully principled Branch-and-Bound algorithm for reducing precision needs in QUBO problems.
Experiments validate our algorithm's effectiveness on an actual quantum annealer.
arXiv Detail & Related papers (2024-09-17T03:07:56Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges.
We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs.
This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z) - EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration [1.741980945827445]
We propose EfficientMorph, a transformer-based architecture for unsupervised 3D image registration.
It optimize the balance between local and global attention through a plane-based attention mechanism.
It reduces computational redundancy via cascaded group attention, and captures fine details without compromising computational efficiency.
arXiv Detail & Related papers (2024-03-16T22:01:55Z) - Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers.
This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches.
In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.