Dynamic Token Pruning in Plain Vision Transformers for Semantic
Segmentation
- URL: http://arxiv.org/abs/2308.01045v2
- Date: Thu, 28 Sep 2023 03:33:16 GMT
- Title: Dynamic Token Pruning in Plain Vision Transformers for Semantic
Segmentation
- Authors: Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, Yifan Liu
- Abstract summary: This work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation.
Experiments suggest that the proposed DToP architecture reduces on average $20% - 35%$ of computational cost for current semantic segmentation methods.
- Score: 18.168932826183024
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision transformers have achieved leading performance on various visual tasks
yet still suffer from high computational complexity. The situation deteriorates
in dense prediction tasks like semantic segmentation, as high-resolution inputs
and outputs usually imply more tokens involved in computations. Directly
removing the less attentive tokens has been discussed for the image
classification task but can not be extended to semantic segmentation since a
dense prediction is required for every patch. To this end, this work introduces
a Dynamic Token Pruning (DToP) method based on the early exit of tokens for
semantic segmentation. Motivated by the coarse-to-fine segmentation process by
humans, we naturally split the widely adopted auxiliary-loss-based network
architecture into several stages, where each auxiliary block grades every
token's difficulty level. We can finalize the prediction of easy tokens in
advance without completing the entire forward pass. Moreover, we keep $k$
highest confidence tokens for each semantic category to uphold the
representative context information. Thus, computational complexity will change
with the difficulty of the input, akin to the way humans do segmentation.
Experiments suggest that the proposed DToP architecture reduces on average
$20\% - 35\%$ of computational cost for current semantic segmentation methods
based on plain vision transformers without accuracy degradation.
Related papers
- SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks.
We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks.
We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z) - MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation [8.46894039954642]
We propose a novel multi-scale token adaptation algorithm for interactive segmentation.
By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified.
We also propose a token learning algorithm based on contrastive loss.
arXiv Detail & Related papers (2024-01-09T07:59:42Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Dynamic Token-Pass Transformers for Semantic Segmentation [22.673910995773262]
We introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation.
DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria.
Our method greatly reduces about 40% $sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers.
arXiv Detail & Related papers (2023-08-03T06:14:24Z) - Token Sparsification for Faster Medical Image Segmentation [37.25161294917211]
We reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline.
STP predicts importance scores with a lightweight sub-network and samples the topK tokens.
MTA restores a full token sequence by assembling both sparse output tokens and pruned multi-layer intermediate ones.
arXiv Detail & Related papers (2023-03-11T23:59:13Z) - Expediting Large-Scale Vision Transformer for Dense Prediction without
Fine-tuning [28.180891300826165]
Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers.
We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens.
Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
arXiv Detail & Related papers (2022-10-03T15:49:48Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - A Simple Baseline for Semi-supervised Semantic Segmentation with Strong
Data Augmentation [74.8791451327354]
We propose a simple yet effective semi-supervised learning framework for semantic segmentation.
A set of simple design and training techniques can collectively improve the performance of semi-supervised semantic segmentation significantly.
Our method achieves state-of-the-art results in the semi-supervised settings on the Cityscapes and Pascal VOC datasets.
arXiv Detail & Related papers (2021-04-15T06:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.