ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
Transformer Acceleration with a Linear Taylor Attention
- URL: http://arxiv.org/abs/2211.05109v1
- Date: Wed, 9 Nov 2022 18:58:21 GMT
- Title: ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
Transformer Acceleration with a Linear Taylor Attention
- Authors: Jyotikrishna Dass, Shang Wu, Huihong Shi, Chaojian Li, Zhifan Ye,
Zhongfeng Wang and Yingyan Lin
- Abstract summary: Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications.
We propose a first-of-its-kind algorithm- hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs.
ViTALiTy unifies both low-rank and sparse components of the attention in ViTs.
- Score: 23.874485033096917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformer (ViT) has emerged as a competitive alternative to
convolutional neural networks for various computer vision applications.
Specifically, ViT multi-head attention layers make it possible to embed
information globally across the overall image. Nevertheless, computing and
storing such attention matrices incurs a quadratic cost dependency on the
number of patches, limiting its achievable efficiency and scalability and
prohibiting more extensive real-world ViT applications on resource-constrained
devices. Sparse attention has been shown to be a promising direction for
improving hardware acceleration efficiency for NLP models. However, a
systematic counterpart approach is still missing for accelerating ViT models.
To close the above gap, we propose a first-of-its-kind algorithm-hardware
codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of
ViTs. Unlike sparsity-based Transformer accelerators for NLP, ViTALiTy unifies
both low-rank and sparse components of the attention in ViTs. At the algorithm
level, we approximate the dot-product softmax operation via first-order Taylor
attention with row-mean centering as the low-rank component to linearize the
cost of attention blocks and further boost the accuracy by incorporating a
sparsity-based regularization. At the hardware level, we develop a dedicated
accelerator to better leverage the resulting workload and pipeline from
ViTALiTy's linear Taylor attention which requires the execution of only the
low-rank component, to further boost the hardware efficiency. Extensive
experiments and ablation studies validate that ViTALiTy offers boosted
end-to-end efficiency (e.g., $3\times$ faster and $3\times$ energy-efficient)
under comparable accuracy, with respect to the state-of-the-art solution.
Related papers
- CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference [4.523939613157408]
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision.
This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs.
ChoSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
arXiv Detail & Related papers (2024-07-17T16:56:06Z) - LPViT: Low-Power Semi-structured Pruning for Vision Transformers [42.91130720962956]
Vision transformers (ViTs) have emerged as a promising alternative to convolutional neural networks for image analysis tasks.
One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption.
We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z) - You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules.
We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage.
Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z) - Accelerating Vision Transformers Based on Heterogeneous Attention
Patterns [89.86293867174324]
Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision.
We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers.
Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
arXiv Detail & Related papers (2023-10-11T17:09:19Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and
Accelerator Co-Design [42.46121663652989]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks.
However, ViTs' self-attention module is still arguably a major bottleneck.
We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
arXiv Detail & Related papers (2022-10-18T04:07:23Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.