Related papers: Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation

Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation

URL: http://arxiv.org/abs/2506.16369v1
Date: Thu, 19 Jun 2025 14:45:46 GMT
Title: Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation
Authors: Pallabi Dutta, Anubhab Maity, Sushmita Mitra,
Abstract summary: This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline.<n>The experimental results show a reduction of $sim$ 35-55% tokens; thus reducing the computational costs relative to the baselines.
Score: 0.06554326244334867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The high computational demands of Vision Transformers (ViTs), in processing a huge number of tokens, often constrain their practical application in analyzing medical images. This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy facilitates end-to-end training, maintains gradient flow, and improves segmentation accuracy by focusing computational resources on essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens; thereby, enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of $\sim$ 35-55\% tokens; thus reducing the computational costs relative to the baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.

Related papers

Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning [8.284127681482202]
'LVTP' is a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering.<n>It integrates high-level semantics and basic visual attributes for precise segmentation.<n>As a plug-and-play module, it requires no architectural changes or additional training.
arXiv Detail & Related papers (2025-04-25T00:43:20Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference.<n>In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation [12.249546377051438]
token merging has exhibited remarkable enhancements in inference speed, training efficiency, and memory utilization for image classification tasks. This paper facilitates the deployment of transformer-based architectures on resource-constrained devices and in real-time applications.
arXiv Detail & Related papers (2024-05-23T11:54:27Z)
PAM-UNet: Shifting Attention on Region of Interest in Medical Images [5.730272874074418]
UNet and its variants face a critical challenge: balancing accuracy with computational efficiency. We propose a novel underlineProgressive underlineAttention based underlineMobile underlineUNet architecture. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87.
arXiv Detail & Related papers (2024-05-02T17:33:26Z)
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation [18.168932826183024]
This work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation. Experiments suggest that the proposed DToP architecture reduces on average $20% - 35%$ of computational cost for current semantic segmentation methods.
arXiv Detail & Related papers (2023-08-02T09:40:02Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.