Revisiting Token Pruning for Object Detection and Instance Segmentation
- URL: http://arxiv.org/abs/2306.07050v3
- Date: Tue, 12 Dec 2023 23:00:25 GMT
- Title: Revisiting Token Pruning for Object Detection and Instance Segmentation
- Authors: Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide
Scaramuzza
- Abstract summary: We investigate token pruning to accelerate inference for object and instance segmentation.
We show a reduction in performance decline from 1.5 mAP to 0.3 mAP in both boxes and masks, compared to existing token pruning methods.
- Score: 25.3324628669201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have shown impressive performance in computer
vision, but their high computational cost, quadratic in the number of tokens,
limits their adoption in computation-constrained applications. However, this
large number of tokens may not be necessary, as not all tokens are equally
important. In this paper, we investigate token pruning to accelerate inference
for object detection and instance segmentation, extending prior works from
image classification. Through extensive experiments, we offer four insights for
dense tasks: (i) tokens should not be completely pruned and discarded, but
rather preserved in the feature maps for later use. (ii) reactivating
previously pruned tokens can further enhance model performance. (iii) a dynamic
pruning rate based on images is better than a fixed pruning rate. (iv) a
lightweight, 2-layer MLP can effectively prune tokens, achieving accuracy
comparable with complex gating networks with a simpler design. We assess the
effects of these design decisions on the COCO dataset and introduce an approach
that incorporates these findings, showing a reduction in performance decline
from ~1.5 mAP to ~0.3 mAP in both boxes and masks, compared to existing token
pruning methods. In relation to the dense counterpart that utilizes all tokens,
our method realizes an increase in inference speed, achieving up to 34% faster
performance for the entire network and 46% for the backbone.
Related papers
- ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score.
Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain.
We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens.
Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.