SPViT: Enabling Faster Vision Transformers via Soft Token Pruning
- URL: http://arxiv.org/abs/2112.13890v1
- Date: Mon, 27 Dec 2021 20:15:25 GMT
- Title: SPViT: Enabling Faster Vision Transformers via Soft Token Pruning
- Authors: Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu
Sun, Bin Ren, Minghai Qin, Hao Tang, Yanzhi Wang
- Abstract summary: Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
- Score: 38.10083471492964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Vision Transformer (ViT) has continuously established new
milestones in the computer vision field, while the high computation and memory
cost makes its propagation in industrial production difficult. Pruning, a
traditional model compression paradigm for hardware efficiency, has been widely
applied in various DNN structures. Nevertheless, it stays ambiguous on how to
perform exclusive pruning on the ViT structure. Considering three key points:
the structural characteristics, the internal data pattern of ViTs, and the
related edge device deployment, we leverage the input token sparsity and
propose a computation-aware soft pruning framework, which can be set up on
vanilla Transformers of both flatten and CNN-type structures, such as
Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based
multi-head token selector, which is a lightweight module for adaptive
instance-wise token selection. We further introduce a soft pruning technique,
which integrates the less informative tokens generated by the selector module
into a package token that will participate in subsequent calculations rather
than being completely discarded. Our framework is bound to the trade-off
between accuracy and computation constraints of specific edge devices through
our proposed computation-aware training strategy. Experimental results show
that our framework significantly reduces the computation cost of ViTs while
maintaining comparable performance on image classification. Moreover, our
framework can guarantee the identified model to meet resource specifications of
mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on
mobile platforms. For example, our method reduces the latency of DeiT-T to 26
ms (26%$\sim $41% superior to existing works) on the mobile device with
0.25%$\sim $4% higher top-1 accuracy on ImageNet. Our code will be released
soon.
Related papers
- Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference [4.523939613157408]
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision.
This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs.
ChoSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
arXiv Detail & Related papers (2024-07-17T16:56:06Z) - LPViT: Low-Power Semi-structured Pruning for Vision Transformers [42.91130720962956]
Vision transformers (ViTs) have emerged as a promising alternative to convolutional neural networks for image analysis tasks.
One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption.
We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers [35.92244135055901]
HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs.
HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies.
HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
arXiv Detail & Related papers (2022-11-15T13:00:43Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.