AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation
- URL: http://arxiv.org/abs/2305.04276v2
- Date: Thu, 14 Mar 2024 07:40:20 GMT
- Title: AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation
- Authors: Jiacheng Lin, Jiajun Chen, Kailun Yang, Alina Roitberg, Siyu Li, Zhiyong Li, Shutao Li,
- Abstract summary: We introduce AdaptiveClick -- a transformer-based, mask-adaptive segmentation framework for Interactive Image (IIS)
The key ingredient of our method is the Click-Aware Mask-adaptive transformer Decoder (CAMD), which enhances the interaction between click and image features.
With a plain ViT backbone, extensive experimental results on nine datasets demonstrate the superiority of AdaptiveClick compared to state-of-the-art methods.
- Score: 51.82915587228898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactive Image Segmentation (IIS) has emerged as a promising technique for decreasing annotation time. Substantial progress has been made in pre- and post-processing for IIS, but the critical issue of interaction ambiguity, notably hindering segmentation quality, has been under-researched. To address this, we introduce AdaptiveClick -- a click-aware transformer incorporating an adaptive focal loss that tackles annotation inconsistencies with tools for mask- and pixel-level ambiguity resolution. To the best of our knowledge, AdaptiveClick is the first transformer-based, mask-adaptive segmentation framework for IIS. The key ingredient of our method is the Click-Aware Mask-adaptive transformer Decoder (CAMD), which enhances the interaction between click and image features. Additionally, AdaptiveClick enables pixel-adaptive differentiation of hard and easy samples in the decision space, independent of their varying distributions. This is primarily achieved by optimizing a generalized Adaptive Focal Loss (AFL) with a theoretical guarantee, where two adaptive coefficients control the ratio of gradient values for hard and easy pixels. Our analysis reveals that the commonly used Focal and BCE losses can be considered special cases of the proposed AFL. With a plain ViT backbone, extensive experimental results on nine datasets demonstrate the superiority of AdaptiveClick compared to state-of-the-art methods. The source code is publicly available at https://github.com/lab206/AdaptiveClick.
Related papers
- Investigating Shift Equivalence of Convolutional Neural Networks in
Industrial Defect Segmentation [3.843350895842836]
In industrial defect segmentation tasks, output consistency (also referred to equivalence) of the model is often overlooked.
A novel pair of down/upsampling layers called component attention polyphase sampling (CAPS) is proposed as a replacement for the conventional sampling layers in CNNs.
The experimental results on the micro surface defect (MSD) dataset and four real-world industrial defect datasets demonstrate that the proposed method exhibits higher equivalence and segmentation performance.
arXiv Detail & Related papers (2023-09-29T00:04:47Z) - Condition-Invariant Semantic Segmentation [77.10045325743644]
We implement Condition-Invariant Semantic (CISS) on the current state-of-the-art domain adaptation architecture.
Our method achieves the second-best performance on the normal-to-adverse Cityscapes$to$ACDC benchmark.
CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night.
arXiv Detail & Related papers (2023-05-27T03:05:07Z) - TransAdapt: A Transformative Framework for Online Test Time Adaptive
Semantic Segmentation [43.31250660146429]
Test-time adaptive (TTA) semantic segmentation adapts a source pre-trained image semantic segmentation model to unlabeled batches of target domain test images.
We propose TransAdapt, a framework that uses transformer and input transformations to improve segmentation performance.
arXiv Detail & Related papers (2023-02-24T01:45:29Z) - Refign: Align and Refine for Adaptation of Semantic Segmentation to
Adverse Conditions [78.71745819446176]
Refign is a generic extension to self-training-based UDA methods which leverages cross-domain correspondences.
Refign consists of two steps: (1) aligning the normal-condition image to the corresponding adverse-condition image using an uncertainty-aware dense matching network, and (2) refining the adverse prediction with the normal prediction using an adaptive label correction mechanism.
The approach introduces no extra training parameters, minimal computational overhead -- during training only -- and can be used as a drop-in extension to improve any given self-training-based UDA method.
arXiv Detail & Related papers (2022-07-14T11:30:38Z) - UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer
via Hierarchical Mask Calibration [49.16591283724376]
We design UniDAformer, a unified domain adaptive panoptic segmentation transformer that is simple but can achieve domain adaptive instance segmentation and semantic segmentation simultaneously within a single network.
UniDAformer introduces Hierarchical Mask (HMC) that rectifies inaccurate predictions at the level of regions, superpixels and annotated pixels via online self-training on the fly.
It has three unique features: 1) it enables unified domain adaptive panoptic adaptation; 2) it mitigates false predictions and improves domain adaptive panoptic segmentation effectively; 3) it is end-to-end trainable with a much simpler training and inference pipeline.
arXiv Detail & Related papers (2022-06-30T07:32:23Z) - The Lighter The Better: Rethinking Transformers in Medical Image
Segmentation Through Adaptive Pruning [26.405243756778606]
We propose to employ adaptive pruning to transformers for medical image segmentation and propose a lightweight network APFormer.
To our best knowledge, this is the first work on transformer pruning for medical image analysis tasks.
We prove, through ablation studies, that adaptive pruning can work as a plug-n-play module for performance improvement on other hybrid-/transformer-based methods.
arXiv Detail & Related papers (2022-06-29T05:49:36Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.