Related papers: Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

URL: http://arxiv.org/abs/2506.05709v1
Date: Fri, 06 Jun 2025 03:18:11 GMT
Title: Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Authors: Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang,
Abstract summary: We propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods.<n>Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $times$1.5 with marginal 0.1% accuracy drop.<n>We extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation.
Score: 8.584066042703972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $\times$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.

Related papers

TP-Spikformer: Token Pruned Spiking Transformer [48.49025085338628]
Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm.<n>We propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance.
arXiv Detail & Related papers (2026-02-28T07:53:26Z)
ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference [12.986605266786839]
We introduce a novel Token Freezing and Reusing framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones.<n>ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy.
arXiv Detail & Related papers (2025-07-22T06:17:44Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning [28.180891300826165]
Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers. We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens. Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
arXiv Detail & Related papers (2022-10-03T15:49:48Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data. We propose a dynamic token sparsification framework to prune redundant tokens. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z)
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%. DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.