Panoptic Diffusion Models: co-generation of images and segmentation maps
- URL: http://arxiv.org/abs/2412.02929v1
- Date: Wed, 04 Dec 2024 00:42:15 GMT
- Title: Panoptic Diffusion Models: co-generation of images and segmentation maps
- Authors: Yinghan Long, Kaushik Roy,
- Abstract summary: We present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently.<n>PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process.
- Score: 7.573297026523597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.
Related papers
- Generating Intermediate Representations for Compositional Text-To-Image Generation [16.757550214291015]
We propose a compositional approach for text-to-image generation based on two stages.
In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations conditioned on text.
In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model.
arXiv Detail & Related papers (2024-10-13T10:24:55Z) - Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation [41.341693150031546]
We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or map, into a photo-realistic face image.
We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes.
Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs.
arXiv Detail & Related papers (2024-05-07T14:33:40Z) - ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer [13.956618446530559]
This paper proposes a zero-shot domain adaptation method based on diffusion models, called ZoDi.
First, we utilize an off-the-shelf diffusion model to synthesize target-like images by transferring the domain of source images to the target domain.
Secondly, we train the model using both source images and synthesized images with the original representations to learn domain-robust representations.
arXiv Detail & Related papers (2024-03-20T14:58:09Z) - On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt.
Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z) - EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training.
Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps.
In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z) - EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion [60.30030562932703]
EpiDiff is a localized interactive multiview diffusion model.
It generates 16 multiview images in just 12 seconds.
It surpasses previous methods in quality evaluation metrics.
arXiv Detail & Related papers (2023-12-11T05:20:52Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis [38.22195812238951]
We propose a novel guidance approach for the sampling process in the diffusion model.
Our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints.
Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process.
arXiv Detail & Related papers (2023-04-28T00:14:28Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - A Structure-Guided Diffusion Model for Large-Hole Image Completion [85.61681358977266]
We develop a structure-guided diffusion model to fill large holes in images.
Our method achieves a superior or comparable visual quality compared to state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-18T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.