Related papers: ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

URL: http://arxiv.org/abs/2507.16260v1
Date: Tue, 22 Jul 2025 06:17:44 GMT
Title: ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference
Authors: Haoyue Zhang, Jie Zhang, Song Guo,
Abstract summary: We introduce a novel Token Freezing and Reusing framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones.<n>ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy.
Score: 12.986605266786839
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.

Related papers

Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z)
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [8.584066042703972]
We propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods.<n>Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $times$1.5 with marginal 0.1% accuracy drop.<n>We extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation.
arXiv Detail & Related papers (2025-06-06T03:18:11Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [6.341420717393898]
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We propose a novel token propagation controller (TPC) that incorporates two different token-distributions.
arXiv Detail & Related papers (2024-01-03T00:10:33Z)
PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT) PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters. It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z)
Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens. Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.