ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency
- URL: http://arxiv.org/abs/2302.10307v1
- Date: Tue, 31 Jan 2023 01:57:52 GMT
- Title: ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency
- Authors: Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang
Liu, Xiaojun Chang, Xiaodan Liang
- Abstract summary: We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
- Score: 126.88107868670767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, great success has been made in learning visual representations from
text supervision, facilitating the emergence of text-supervised semantic
segmentation. However, existing works focus on pixel grouping and cross-modal
semantic alignment, while ignoring the correspondence among multiple augmented
views of the same image. To overcome such limitation, we propose
multi-\textbf{View} \textbf{Co}nsistent learning (ViewCo) for text-supervised
semantic segmentation. Specifically, we first propose text-to-views consistency
modeling to learn correspondence for multiple views of the same input image.
Additionally, we propose cross-view segmentation consistency modeling to
address the ambiguity issue of text supervision by contrasting the segment
features of Siamese visual encoders. The text-to-views consistency benefits the
dense assignment of the visual features by encouraging different crops to align
with the same text, while the cross-view segmentation consistency modeling
provides additional self-supervision, overcoming the limitation of ambiguous
text supervision for segmentation masks. Trained with large-scale image-text
data, our model can directly segment objects of arbitrary categories in a
zero-shot manner. Extensive experiments show that ViewCo outperforms
state-of-the-art methods on average by up to 2.9\%, 1.6\%, and 2.4\% mIoU on
PASCAL VOC2012, PASCAL Context, and COCO, respectively.
Related papers
- InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method for semantic segmentation.
We introduce Contrastive Soft Clustering to align masks with the image's structure information.
InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation [28.24883865053459]
This paper aims to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations.
Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts.
A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
arXiv Detail & Related papers (2024-04-05T17:25:17Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - Weakly-Supervised Text Instance Segmentation [44.20745377169349]
We take the first attempt to perform weakly-supervised text instance segmentation by bridging text recognition and text segmentation.
The proposed method significantly outperforms weakly-supervised instance segmentation methods on ICDAR13-FST (18.95$%$ improvement) and TextSeg (17.80$%$ improvement) benchmarks.
arXiv Detail & Related papers (2023-03-20T03:56:47Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.