Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation
- URL: http://arxiv.org/abs/2404.11998v1
- Date: Thu, 18 Apr 2024 08:46:12 GMT
- Title: Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation
- Authors: Qiyuan Dai, Sibei Yang,
- Abstract summary: Point PrompTing (PPT) is a point generator that harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability.
PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU.
- Score: 13.924553294859315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.
Related papers
- Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation [10.958014189747356]
We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS)
Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets.
It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS.
arXiv Detail & Related papers (2024-07-10T07:14:48Z) - HARIS: Human-Like Attention for Reference Image Segmentation [5.808325471170541]
We propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism.
Our method achieves state-of-the-art performance and great zero-shot ability.
arXiv Detail & Related papers (2024-05-17T11:29:23Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions.
We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above.
Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z) - MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z) - Extending CLIP's Image-Text Alignment to Referring Image Segmentation [48.26552693472177]
Referring Image (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression.
We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS.
arXiv Detail & Related papers (2023-06-14T13:27:28Z) - CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image
Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD.
Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment.
Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.