Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation
- URL: http://arxiv.org/abs/2506.16369v1
- Date: Thu, 19 Jun 2025 14:45:46 GMT
- Title: Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation
- Authors: Pallabi Dutta, Anubhab Maity, Sushmita Mitra,
- Abstract summary: This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline.<n>The experimental results show a reduction of $sim$ 35-55% tokens; thus reducing the computational costs relative to the baselines.
- Score: 0.06554326244334867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The high computational demands of Vision Transformers (ViTs), in processing a huge number of tokens, often constrain their practical application in analyzing medical images. This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy facilitates end-to-end training, maintains gradient flow, and improves segmentation accuracy by focusing computational resources on essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens; thereby, enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of $\sim$ 35-55\% tokens; thus reducing the computational costs relative to the baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.
Related papers
- Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning [8.284127681482202]
'LVTP' is a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering.<n>It integrates high-level semantics and basic visual attributes for precise segmentation.<n>As a plug-and-play module, it requires no architectural changes or additional training.
arXiv Detail & Related papers (2025-04-25T00:43:20Z) - Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z) - TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z) - Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference.<n>In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation [12.249546377051438]
token merging has exhibited remarkable enhancements in inference speed, training efficiency, and memory utilization for image classification tasks.
This paper facilitates the deployment of transformer-based architectures on resource-constrained devices and in real-time applications.
arXiv Detail & Related papers (2024-05-23T11:54:27Z) - PAM-UNet: Shifting Attention on Region of Interest in Medical Images [5.730272874074418]
UNet and its variants face a critical challenge: balancing accuracy with computational efficiency.
We propose a novel underlineProgressive underlineAttention based underlineMobile underlineUNet architecture.
Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87.
arXiv Detail & Related papers (2024-05-02T17:33:26Z) - Dynamic Token Pruning in Plain Vision Transformers for Semantic
Segmentation [18.168932826183024]
This work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation.
Experiments suggest that the proposed DToP architecture reduces on average $20% - 35%$ of computational cost for current semantic segmentation methods.
arXiv Detail & Related papers (2023-08-02T09:40:02Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.