Exploring Open-Vocabulary Semantic Segmentation without Human Labels
- URL: http://arxiv.org/abs/2306.00450v1
- Date: Thu, 1 Jun 2023 08:47:06 GMT
- Title: Exploring Open-Vocabulary Semantic Segmentation without Human Labels
- Authors: Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan,
Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana
- Abstract summary: We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
- Score: 76.15862573035565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic segmentation is a crucial task in computer vision that involves
segmenting images into semantically meaningful regions at the pixel level.
However, existing approaches often rely on expensive human annotations as
supervision for model training, limiting their scalability to large, unlabeled
datasets. To address this challenge, we present ZeroSeg, a novel method that
leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to
train open-vocabulary zero-shot semantic segmentation models. Although acquired
extensive knowledge of visual concepts, it is non-trivial to exploit knowledge
from these VL models to the task of semantic segmentation, as they are usually
trained at an image level. ZeroSeg overcomes this by distilling the visual
concepts learned by VL models into a set of segment tokens, each summarizing a
localized region of the target image. We evaluate ZeroSeg on multiple popular
segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO,
in a zero-shot manner (i.e., no training or adaption on target segmentation
datasets). Our approach achieves state-of-the-art performance when compared to
other zero-shot segmentation methods under the same training data, while also
performing competitively compared to strongly supervised methods. Finally, we
also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation,
through both human studies and qualitative visualizations.
Related papers
- A Simple Framework for Open-Vocabulary Zero-Shot Segmentation [36.01531912271202]
SimZSS is a framework for open-vocabulary Zero-Shot segmentation.
It exploits the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions.
SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
arXiv Detail & Related papers (2024-06-23T11:57:08Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language
Guidance [97.00445262074595]
In SemiVL, we propose to integrate rich priors from vision-language models into semi-supervised semantic segmentation.
We design a language-guided decoder to jointly reason over vision and language.
We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - IFSeg: Image-free Semantic Segmentation via Vision-Language Model [67.62922228676273]
We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories.
We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens.
Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
arXiv Detail & Related papers (2023-03-25T08:19:31Z) - Learning Hierarchical Image Segmentation For Recognition and By Recognition [39.712584686731574]
We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives.
We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition.
Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images masks) by absolute 8% in mIoU on PartImageNet object segmentation.
arXiv Detail & Related papers (2022-10-01T16:31:44Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Zero-Shot Semantic Segmentation via Spatial and Multi-Scale Aware Visual
Class Embedding [0.0]
We propose a language-model-free zero-shot semantic segmentation framework, Spatial and Multi-scale aware Visual Class Embedding Network (SM-VCENet)
In experiments, our SM-VCENet outperforms zero-shot semantic segmentation state-of-the-art by a relative margin.
arXiv Detail & Related papers (2021-11-30T07:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.