MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation
- URL: http://arxiv.org/abs/2401.04403v2
- Date: Sat, 3 Feb 2024 03:50:42 GMT
- Title: MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation
- Authors: Long Xu, Shanghong Li, Yongquan Chen, Jun Luo, Shiwu Lai
- Abstract summary: We propose a novel multi-scale token adaptation algorithm for interactive segmentation.
By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified.
We also propose a token learning algorithm based on contrastive loss.
- Score: 8.46894039954642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactive segmentation has gained significant attention for its application
in human-computer interaction and data annotation. To address the target scale
variation issue in interactive segmentation, a novel multi-scale token
adaptation algorithm is proposed. By performing top-k operations across
multi-scale tokens, the computational complexity is greatly simplified while
ensuring performance. To enhance the robustness of multi-scale token selection,
we also propose a token learning algorithm based on contrastive loss. This
algorithm can effectively improve the performance of multi-scale token
adaptation. Extensive benchmarking shows that the algorithm achieves
state-of-the-art (SOTA) performance, compared to current methods. An
interactive demo and all reproducible codes will be released at
https://github.com/hahamyt/mst.
Related papers
- Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision [33.59153869330463]
An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency.
Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases.
Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method.
arXiv Detail & Related papers (2024-08-13T16:34:06Z) - Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation [12.249546377051438]
token merging has exhibited remarkable enhancements in inference speed, training efficiency, and memory utilization for image classification tasks.
This paper facilitates the deployment of transformer-based architectures on resource-constrained devices and in real-time applications.
arXiv Detail & Related papers (2024-05-23T11:54:27Z) - Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC)
SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.
We propose a versatile vision backbone, SecViT, which attains an impressive textbf84.2% image classification accuracy with only textbf27M parameters and textbf4.4G FLOPs.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Multi-level Contrast Network for Wearables-based Joint Activity
Segmentation and Recognition [10.828099015828693]
Human activity recognition (HAR) with wearables is promising research that can be widely adopted in many smart healthcare applications.
Most HAR algorithms are susceptible to the multi-class windows problem that is essential yet rarely exploited.
We introduce the segmentation technology into HAR, yielding joint activity segmentation and recognition.
arXiv Detail & Related papers (2022-08-16T05:39:02Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised
Semantic Segmentation [88.49669148290306]
We propose a novel weakly supervised multi-task framework called AuxSegNet to leverage saliency detection and multi-label image classification as auxiliary tasks.
Inspired by their similar structured semantics, we also propose to learn a cross-task global pixel-level affinity map from the saliency and segmentation representations.
The learned cross-task affinity can be used to refine saliency predictions and propagate CAM maps to provide improved pseudo labels for both tasks.
arXiv Detail & Related papers (2021-07-25T11:39:58Z) - Reviving Iterative Training with Mask Guidance for Interactive
Segmentation [8.271859911016719]
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes.
We propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps.
We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models.
arXiv Detail & Related papers (2021-02-12T15:44:31Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.