Diffusion Model is Secretly a Training-free Open Vocabulary Semantic
Segmenter
- URL: http://arxiv.org/abs/2309.02773v3
- Date: Mon, 22 Jan 2024 07:18:55 GMT
- Title: Diffusion Model is Secretly a Training-free Open Vocabulary Semantic
Segmenter
- Authors: Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu,
Lu Sheng, Dong Xu
- Abstract summary: generative text-to-image diffusion models are highly efficient open-vocabulary semantic segmenters.
We introduce a novel training-free approach named DiffSegmenter to generate realistic objects that are semantically faithful to the input text.
Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
- Score: 47.29967666846132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pre-trained text-image discriminative models, such as CLIP, has been
explored for open-vocabulary semantic segmentation with unsatisfactory results
due to the loss of crucial localization information and awareness of object
shapes. Recently, there has been a growing interest in expanding the
application of generative models from generation tasks to semantic
segmentation. These approaches utilize generative models either for generating
annotated data or extracting features to facilitate semantic segmentation. This
typically involves generating a considerable amount of synthetic data or
requiring additional mask annotations. To this end, we uncover the potential of
generative text-to-image diffusion models (e.g., Stable Diffusion) as highly
efficient open-vocabulary semantic segmenters, and introduce a novel
training-free approach named DiffSegmenter. The insight is that to generate
realistic objects that are semantically faithful to the input text, both the
complete object shapes and the corresponding semantics are implicitly learned
by diffusion models. We discover that the object shapes are characterized by
the self-attention maps while the semantics are indicated through the
cross-attention maps produced by the denoising U-Net, forming the basis of our
segmentation results.Additionally, we carefully design effective textual
prompts and a category filtering mechanism to further enhance the segmentation
results. Extensive experiments on three benchmark datasets show that the
proposed DiffSegmenter achieves impressive results for open-vocabulary semantic
segmentation.
Related papers
- Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training.
Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps.
In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Semantic Segmentation by Semantic Proportions [6.171990546748665]
We propose a novel approach for semantic segmentation, requiring the rough information of individual semantic class proportions.
This greatly simplifies the data annotation process and thus will significantly reduce the annotation time, cost and storage space.
arXiv Detail & Related papers (2023-05-24T22:51:52Z) - Open-vocabulary Object Segmentation with Diffusion Models [47.36233857830832]
The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map.
We adopt the augmented diffusion model to build a synthetic semantic segmentation dataset, and show that, training a standard segmentation model on such dataset demonstrates competitive performance on the zero-shot segmentation(ZS3) benchmark.
arXiv Detail & Related papers (2023-01-12T18:59:08Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Label-Efficient Semantic Segmentation with Diffusion Models [27.01899943738203]
We demonstrate that diffusion models can also serve as an instrument for semantic segmentation.
In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process.
We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem.
arXiv Detail & Related papers (2021-12-06T15:55:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.