IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis
- URL: http://arxiv.org/abs/2503.00936v1
- Date: Sun, 02 Mar 2025 15:19:37 GMT
- Title: IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis
- Authors: Yuji Wang, Jingchen Ni, Yong Liu, Chun Yuan, Yansong Tang,
- Abstract summary: Zero-shot Referring Image identifies the instance mask that best aligns with a referring expression without training and fine-tuning.<n>Previous CLIP-based models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects.<n>IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.
- Score: 46.502962768034166
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Zero-shot Referring Image Segmentation (RIS) identifies the instance mask that best aligns with a specified referring expression without training and fine-tuning, significantly reducing the labor-intensive annotation process. Despite achieving commendable results, previous CLIP-based models have a critical drawback: the models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects. This is because they generate all possible masks on an image and evaluate each masked region for similarity to the given expression, often resulting in decreased sensitivity to direct positional clues in text inputs. Moreover, most methods have weak abilities to manage relationships between primary words and their contexts, causing confusion and reduced accuracy in identifying the correct target region. To address these challenges, we propose IteRPrimE (Iterative Grad-CAM Refinement and Primary word Emphasis), which leverages a saliency heatmap through Grad-CAM from a Vision-Language Pre-trained (VLP) model for image-text matching. An iterative Grad-CAM refinement strategy is introduced to progressively enhance the model's focus on the target region and overcome positional insensitivity, creating a self-correcting effect. Additionally, we design the Primary Word Emphasis module to help the model handle complex semantic relations, enhancing its ability to attend to the intended object. Extensive experiments conducted on the RefCOCO/+/g, and PhraseCut benchmarks demonstrate that IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.
Related papers
- Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.
One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.
We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z) - Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation [33.336549577936196]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision.<n>We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space.<n>SemPLeS can perform better semantic alignment between object regions and class labels, resulting in desired pseudo masks for training segmentation models.
arXiv Detail & Related papers (2024-01-22T09:41:05Z) - Semi-supervised Semantic Segmentation Meets Masked Modeling:Fine-grained
Locality Learning Matters in Consistency Regularization [31.333862320143968]
Semi-supervised semantic segmentation aims to utilize limited labeled images and abundant unlabeled images to achieve label-efficient learning.
We propose a novel framework called textttMaskMatch, which enables fine-grained locality learning to achieve better dense segmentation.
arXiv Detail & Related papers (2023-12-14T03:28:53Z) - Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions.
We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above.
Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Efficient Modeling of Future Context for Image Captioning [38.52032153180971]
Non-Autoregressive Image Captioning ( NAIC) can leverage two-side relation with modified mask operation.
Our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2022-07-22T06:21:43Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Causal Intervention for Weakly-Supervised Semantic Segmentation [122.1846968696862]
We aim to generate better pixel-level pseudo-masks by using only image-level labels.
We propose a structural causal model to analyze the causalities among images, contexts, and class labels.
Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification.
arXiv Detail & Related papers (2020-09-26T09:26:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.