Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
- URL: http://arxiv.org/abs/2211.13224v2
- Date: Wed, 21 Jun 2023 12:35:16 GMT
- Title: Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
- Authors: Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo
- Abstract summary: Peekaboo is a first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding technique.
We show how Peekaboo can be used to generate images with transparency, even though the underlying diffusion model was only trained on RGB images.
- Score: 40.959642112729234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, text-to-image diffusion models have shown remarkable capabilities
in creating realistic images from natural language prompts. However, few works
have explored using these models for semantic localization or grounding. In
this work, we explore how an off-the-shelf text-to-image diffusion model,
trained without exposure to localization information, can ground various
semantic phrases without segmentation-specific re-training. We introduce an
inference time optimization process capable of generating segmentation masks
conditioned on natural language prompts. Our proposal, Peekaboo, is a
first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding
technique leveraging diffusion models without any training. We evaluate
Peekaboo on the Pascal VOC dataset for unsupervised semantic segmentation and
the RefCOCO dataset for referring segmentation, showing results competitive
with promising results. We also demonstrate how Peekaboo can be used to
generate images with transparency, even though the underlying diffusion model
was only trained on RGB images - which to our knowledge we are the first to
attempt. Please see our project page, including our code:
https://ryanndagreat.github.io/peekaboo
Related papers
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z) - Freestyle Layout-to-Image Synthesis [42.64485133926378]
In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics onto a given layout.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics.
The proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs.
arXiv Detail & Related papers (2023-03-25T09:37:41Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - What the DAAM: Interpreting Stable Diffusion Using Cross Attention [39.97805685586423]
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation.
They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature.
We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork.
We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
arXiv Detail & Related papers (2022-10-10T17:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.