Related papers: Exploring Token Pruning in Vision State Space Models

Exploring Token Pruning in Vision State Space Models

URL: http://arxiv.org/abs/2409.18962v1
Date: Fri, 27 Sep 2024 17:59:50 GMT
Title: Exploring Token Pruning in Vision State Space Models
Authors: Zheng Zhan, Zhenglun Kong, Yifan Gong, Yushu Wu, Zichong Meng, Hangyu Zheng, Xuan Shen, Stratis Ioannidis, Wei Niu, Pu Zhao, Yanzhi Wang,
Abstract summary: State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers. We take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. We achieve 81.7% accuracy on ImageNet with a 41.6% reduction in the FLOPs for pruned PlainMamba-L3.
Score: 38.122017567843905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7\% accuracy on ImageNet with a 41.6\% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.

Related papers

Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation [5.326302374594885]
Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios. We propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors.
arXiv Detail & Related papers (2025-04-20T04:12:38Z)
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method. It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z)
Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency [30.354755381533433]
Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. Recent token reduction methods claim to achieve minimal performance loss. We propose LoFi--a training-free visual token reduction method.
arXiv Detail & Related papers (2025-03-09T22:16:48Z)
Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning [63.55145330447408]
Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. Existing methods that directly apply SAM through prompting often overlook the domain shift issue. We propose a novel Self-Perceptinon Tuning (SPT) method, aiming to enhance SAM's perception capability for anomaly segmentation.
arXiv Detail & Related papers (2024-11-26T08:33:25Z)
Rethinking Token Reduction for State Space Models [47.00760373683448]
We propose a tailored, unified post-training token reduction method for State Space Models (SSMs) Our approach integrates token importance and similarity, thus taking advantage of both pruning and merging. Our method improves the average accuracy by 5.7% to 13.1% on six benchmarks with Mamba-2 compared to existing methods.
arXiv Detail & Related papers (2024-10-16T00:06:13Z)
big.LITTLE Vision Transformer for Efficient Visual Recognition [34.015778625984055]
big.LITTLE Vision Transformer is an innovative architecture aimed at achieving efficient visual recognition. System is composed of two distinct blocks: the big performance block and the LITTLE efficiency block. When processing an image, our system determines the importance of each token and allocates them accordingly.
arXiv Detail & Related papers (2024-10-14T08:21:00Z)
Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts. We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z)
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence. We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN) We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z)
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models [35.5601603013045]
We propose SmartTrim, an adaptive acceleration framework for Vision-Language Models (VLMs) We integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer. We devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart.
arXiv Detail & Related papers (2023-05-24T11:18:00Z)
Depth Estimation with Simplified Transformer [4.565830918989131]
Transformer and its variants have shown state-of-the-art results in many vision tasks recently. We propose a method for self-supervised monocular Depth Estimation with simplified Transformer (DEST) Our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art.
arXiv Detail & Related papers (2022-04-28T21:39:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.