Referring Image Segmentation Using Text Supervision
- URL: http://arxiv.org/abs/2308.14575v1
- Date: Mon, 28 Aug 2023 13:40:47 GMT
- Title: Referring Image Segmentation Using Text Supervision
- Authors: Fang Liu, Yuhao Liu, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin,
Gerhard Hancke, Rynson Lau
- Abstract summary: Existing Referring Image (RIS) methods typically require expensive pixel-level or box-level annotations for supervision.
We propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process.
Our framework achieves promising performances to existing fully-supervised RIS methods while outperforming state-of-the-art weakly-supervised methods adapted from related areas.
- Score: 44.27304699305985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Referring Image Segmentation (RIS) methods typically require
expensive pixel-level or box-level annotations for supervision. In this paper,
we observe that the referring texts used in RIS already provide sufficient
information to localize the target object. Hence, we propose a novel
weakly-supervised RIS framework to formulate the target localization problem as
a classification process to differentiate between positive and negative text
expressions. While the referring text expressions for an image are used as
positive expressions, the referring text expressions from other images can be
used as negative expressions for this image. Our framework has three main
novelties. First, we propose a bilateral prompt method to facilitate the
classification process, by harmonizing the domain discrepancy between visual
and linguistic features. Second, we propose a calibration method to reduce
noisy background information and improve the correctness of the response maps
for target object localization. Third, we propose a positive response map
selection strategy to generate high-quality pseudo-labels from the enhanced
response maps, for training a segmentation network for RIS inference. For
evaluation, we propose a new metric to measure localization accuracy.
Experiments on four benchmarks show that our framework achieves promising
performances to existing fully-supervised RIS methods while outperforming
state-of-the-art weakly-supervised methods adapted from related areas. Code is
available at https://github.com/fawnliu/TRIS.
Related papers
- Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - HARIS: Human-Like Attention for Reference Image Segmentation [5.808325471170541]
We propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism.
Our method achieves state-of-the-art performance and great zero-shot ability.
arXiv Detail & Related papers (2024-05-17T11:29:23Z) - Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label [7.400926717561454]
This paper investigates a framework for weakly-supervised object localization.
It aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels.
arXiv Detail & Related papers (2024-04-15T06:02:09Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - Bilateral Reference for High-Resolution Dichotomous Image Segmentation [109.35828258964557]
We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS)
It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef)
Within the RM, we utilize BiRef for the reconstruction process, where hierarchical patches of images provide the source reference and gradient maps serve as the target reference.
arXiv Detail & Related papers (2024-01-07T07:56:47Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.