Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
- URL: http://arxiv.org/abs/2310.18049v1
- Date: Fri, 27 Oct 2023 10:52:50 GMT
- Title: Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
- Authors: Yucheng Suo, Linchao Zhu, Yi Yang
- Abstract summary: We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
- Score: 60.84423786769453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study a challenging task of zero-shot referring image
segmentation. This task aims to identify the instance mask that is most related
to a referring expression without training on pixel-level annotations. Previous
research takes advantage of pre-trained cross-modal models, e.g., CLIP, to
align instance-level masks with referring expressions. %Yet, CLIP only
considers image-text pair level alignment, which neglects fine-grained image
region and complex sentence matching. Yet, CLIP only considers the global-level
alignment of image-text pairs, neglecting fine-grained matching between the
referring sentence and local image regions. To address this challenge, we
introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image
segmentation framework that is training-free and robust to various visual
encoders. TAS incorporates a mask proposal network for instance-level mask
extraction, a text-augmented visual-text matching score for mining the
image-text correlation, and a spatial rectifier for mask post-processing.
Notably, the text-augmented visual-text matching score leverages a $P$ score
and an $N$-score in addition to the typical visual-text matching score. The
$P$-score is utilized to close the visual-text domain gap through a surrogate
captioning model, where the score is computed between the surrogate
model-generated texts and the referring expression. The $N$-score considers the
fine-grained alignment of region-text pairs via negative phrase mining,
encouraging the masked image to be repelled from the mined distracting phrases.
Extensive experiments are conducted on various datasets, including RefCOCO,
RefCOCO+, and RefCOCOg. The proposed method clearly outperforms
state-of-the-art zero-shot referring image segmentation methods.
Related papers
- InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method for semantic segmentation.
We introduce Contrastive Soft Clustering to align masks with the image's structure information.
InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z) - Zero-shot Referring Image Segmentation with Global-Local Context
Features [8.77461711080319]
Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image.
We propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP.
In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins.
arXiv Detail & Related papers (2023-03-31T06:00:50Z) - Learning to Generate Text-grounded Mask for Open-world Semantic
Segmentation from Only Image-Text Pairs [10.484851004093919]
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images.
Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts.
We propose a novel Text-grounded Contrastive Learning framework that enables a model to directly learn region-text alignment.
arXiv Detail & Related papers (2022-12-01T18:59:03Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Detector-Free Weakly Supervised Grounding by Separation [76.65699170882036]
Weakly Supervised phrase-Grounding (WSG) deals with the task of using data to learn to localize arbitrary text phrases in images.
We propose Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
We demonstrate a significant accuracy improvement, of up to $8.5%$ over previous DF-WSG SotA.
arXiv Detail & Related papers (2021-04-20T08:27:31Z) - Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.