Related papers: SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

URL: http://arxiv.org/abs/2112.13890v1
Date: Mon, 27 Dec 2021 20:15:25 GMT
Title: SPViT: Enabling Faster Vision Transformers via Soft Token Pruning
Authors: Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Bin Ren, Minghai Qin, Hao Tang, Yanzhi Wang
Abstract summary: Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures. Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
Score: 38.10083471492964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%$\sim $41% superior to existing works) on the mobile device with 0.25%$\sim $4% higher top-1 accuracy on ImageNet. Our code will be released soon.

Related papers

Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself. We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference [4.523939613157408]
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs. ChoSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
arXiv Detail & Related papers (2024-07-17T16:56:06Z)
LPViT: Low-Power Semi-structured Pruning for Vision Transformers [42.91130720962956]
Vision transformers (ViTs) have emerged as a promising alternative to convolutional neural networks for image analysis tasks. One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption. We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers [35.92244135055901]
HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs. HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies. HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
arXiv Detail & Related papers (2022-11-15T13:00:43Z)
A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.