Related papers: Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive

Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive

URL: http://arxiv.org/abs/2507.09612v1
Date: Sun, 13 Jul 2025 12:33:37 GMT
Title: Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive
Authors: You Huang, Lichao Chen, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji,
Abstract summary: Interactive segmentation improves annotation efficiency by segmenting target regions from user prompts.<n>Current approaches face a critical trade-off: dense-token methods achieve superior accuracy but suffer from prohibitively slow processing on CPU devices.<n>We propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing.
Score: 58.0729162588429
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.

Related papers

High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z)
CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection [7.262250906929891]
Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection.<n>To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations.<n>First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism.<n>Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery.
arXiv Detail & Related papers (2025-04-02T03:22:36Z)
EDM: Efficient Deep Feature Matching [8.107498154867178]
We propose an Efficient Deep feature Matching network, EDM.<n>We first adopt a deeper CNN with fewer dimensions to extract multi-level features.<n>Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features.<n>In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features.
arXiv Detail & Related papers (2025-03-07T03:47:30Z)
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [11.146015814220858]
We propose FiRST, an algorithm that reduces latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence.<n>FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware.<n>Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks.
arXiv Detail & Related papers (2024-10-16T12:45:35Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models. Recent studies extend the SAM to Few-shot Semantic segmentation (FSS) We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z)
Efficient Time Series Processing for Transformers and State-Space Models through Token Merging [44.27818172708914]
Token merging has emerged as a solution to increase computational efficiency in computer vision architectures.<n>We introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood.<n>Our comprehensive empirical evaluation demonstrates that local merging offers substantial efficiency gains with minimal impact on accuracy.
arXiv Detail & Related papers (2024-05-28T08:28:18Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Threshold-adaptive Unsupervised Focal Loss for Domain Adaptation of Semantic Segmentation [25.626882426111198]
Unsupervised domain adaptation (UDA) for semantic segmentation has recently gained increasing research attention. In this paper, we propose a novel two-stage entropy-based UDA method for semantic segmentation. Our method achieves state-of-the-art 58.4% and 59.6% mIoUs on SYNTHIA-to-Cityscapes and GTA5-to-Cityscapes using DeepLabV2 and competitive performance using the lightweight BiSeNet.
arXiv Detail & Related papers (2022-08-23T03:48:48Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z)
DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference [86.03382625531951]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.<n>It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.<n>Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z)
Boundary-assisted Region Proposal Networks for Nucleus Segmentation [89.69059532088129]
Machine learning models cannot perform well because of large amount of crowded nuclei. We devise a Boundary-assisted Region Proposal Network (BRP-Net) that achieves robust instance-level nucleus segmentation.
arXiv Detail & Related papers (2020-06-04T08:26:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.