Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision
- URL: http://arxiv.org/abs/2308.15512v2
- Date: Tue, 24 Oct 2023 13:57:40 GMT
- Title: Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision
- Authors: Dongwon Kim, Namyup Kim, Cuiling Lan, Suha Kwak
- Abstract summary: We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
- Score: 52.46081425504072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring image segmentation, the task of segmenting any arbitrary entities
described in free-form texts, opens up a variety of vision applications.
However, manual labeling of training data for this task is prohibitively
costly, leading to lack of labeled data for training. We address this issue by
a weakly supervised learning approach using text descriptions of training
images as the only source of supervision. To this end, we first present a new
model that discovers semantic entities in input image and then combines such
entities relevant to text query to predict the mask of the referent. We also
present a new loss function that allows the model to be trained without any
further supervision. Our method was evaluated on four public benchmarks for
referring image segmentation, where it clearly outperformed the existing method
for the same task and recent open-vocabulary segmentation models on all the
benchmarks.
Related papers
- Language-guided Few-shot Semantic Segmentation [23.46604057006498]
We propose an innovative solution to tackle the challenge of few-shot semantic segmentation using only language information.
Our approach involves a vision-language-driven mask distillation scheme, which generates high quality pseudo-semantic masks from text prompts.
Experiments on two benchmark datasets demonstrate that our method establishes a new baseline for language-guided few-shot semantic segmentation.
arXiv Detail & Related papers (2023-11-23T09:08:49Z) - From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models [38.14123683674355]
We propose a method to utilize the attention mechanism in the denoising network of text-to-image diffusion models.
We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting.
Our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
arXiv Detail & Related papers (2023-09-08T04:10:01Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - IFSeg: Image-free Semantic Segmentation via Vision-Language Model [67.62922228676273]
We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories.
We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens.
Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
arXiv Detail & Related papers (2023-03-25T08:19:31Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - From colouring-in to pointillism: revisiting semantic segmentation
supervision [48.637031591058175]
We propose a pointillist approach for semantic segmentation annotation, where only point-wise yes/no questions are answered.
We collected and released 22.6M point labels over 4,171 classes on the Open Images dataset.
arXiv Detail & Related papers (2022-10-25T16:42:03Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.