Related papers: Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

URL: http://arxiv.org/abs/2507.09702v1
Date: Sun, 13 Jul 2025 16:26:05 GMT
Title: Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI
Authors: Phat Nguyen, Ngai-Man Cheung,
Abstract summary: Token compression techniques have emerged as powerful tools for Vision Transformer (ViT) inference in computer vision.<n>We present the first systematic taxonomy and comparative study of token compression methods.<n>Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs.
Score: 26.45869748408205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these methods aim to remove less informative tokens before the attention layers to improve inference throughput. While numerous studies have explored various accuracy-efficiency trade-offs on large-scale ViTs, two critical gaps remain. First, there is a lack of unified survey that systematically categorizes and compares token compression approaches based on their core strategies (e.g., pruning, merging, or hybrid) and deployment settings (e.g., fine-tuning vs. plug-in). Second, most benchmarks are limited to standard ViT models (e.g., ViT-B, ViT-L), leaving open the question of whether such methods remain effective when applied to structurally compressed transformers, which are increasingly deployed on resource-constrained edge devices. To address these gaps, we present the first systematic taxonomy and comparative study of token compression methods, and we evaluate representative techniques on both standard and compact ViT architectures. Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs. These findings not only provide practical insights but also pave the way for future research on adapting token optimization techniques to compact transformer-based networks for edge AI and AI agent applications.

Related papers

Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation [59.1067331268383]
Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data.<n>To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens.<n>We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency.
arXiv Detail & Related papers (2025-08-05T12:40:55Z)
MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions [1.0411839100853515]
MoR-ViT is a novel vision transformer framework that incorporates a token-level dynamic recursion mechanism.<n>Experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference acceleration.
arXiv Detail & Related papers (2025-07-29T12:46:36Z)
Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies [0.0]
Vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks.<n>High computational complexity and memory demands pose challenges for deployment on resource-constrained edge devices.
arXiv Detail & Related papers (2025-02-26T22:34:44Z)
MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity [32.532780329341186]
Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset.<n>Several DFQ methods have been proposed for vision transformer (ViT) architectures, but they fail to achieve efficacy in low-bit settings.<n>We devise MimiQ, a novel DFQ method designed for ViTs that enhances inter-head attention similarity.
arXiv Detail & Related papers (2024-07-29T13:57:40Z)
Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression [63.23578860867408]
We investigate how to integrate the evaluations of importance and sparsity scores into a single stage. We present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores. Experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods.
arXiv Detail & Related papers (2024-03-23T13:22:36Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers. We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z)
Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation. We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations. Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z)
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z)
A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.