Weakly-supervised segmentation of referring expressions
- URL: http://arxiv.org/abs/2205.04725v2
- Date: Thu, 12 May 2022 07:17:56 GMT
- Title: Weakly-supervised segmentation of referring expressions
- Authors: Robin Strudel, Ivan Laptev, Cordelia Schmid
- Abstract summary: Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
- Score: 81.73850439141374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding localizes regions (boxes or segments) in the image
corresponding to given referring expressions. In this work we address image
segmentation from referring expressions, a problem that has so far only been
addressed in a fully-supervised setting. A fully-supervised setup, however,
requires pixel-wise supervision and is hard to scale given the expense of
manual annotation. We therefore introduce a new task of weakly-supervised image
segmentation from referring expressions and propose Text grounded semantic
SEGgmentation (TSEG) that learns segmentation masks directly from image-level
referring expressions without pixel-level annotations. Our transformer-based
method computes patch-text similarities and guides the classification objective
during training with a new multi-label patch assignment mechanism. The
resulting visual grounding model segments image regions corresponding to given
natural language expressions. Our approach TSEG demonstrates promising results
for weakly-supervised referring expression segmentation on the challenging
PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when
evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.
Related papers
- InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.
We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.
InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation [33.336549577936196]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision.
We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space.
SemPLeS can perform better semantic alignment between object regions and class labels, resulting in desired pseudo masks for training segmentation models.
arXiv Detail & Related papers (2024-01-22T09:41:05Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - Associating Spatially-Consistent Grouping with Text-supervised Semantic
Segmentation [117.36746226803993]
We introduce self-supervised spatially-consistent grouping with text-supervised semantic segmentation.
Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition.
Our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks.
arXiv Detail & Related papers (2023-04-03T16:24:39Z) - Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression.
We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps.
Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.