Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
- URL: http://arxiv.org/abs/2501.09194v2
- Date: Mon, 10 Feb 2025 18:54:23 GMT
- Title: Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
- Authors: Ahmad Süleyman, Göksel Biricik,
- Abstract summary: Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions.
We propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information.
- Score: 0.0
- License:
- Abstract: Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP$_{\text{50}}$ of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.
Related papers
- ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models [21.158266387658905]
This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes.
We first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions.
We then propose to control the diffusion model using synthesized visual robustness prompts with spatial constraints and object-wise textual descriptions.
arXiv Detail & Related papers (2024-05-24T04:10:34Z) - DODA: Diffusion for Object-detection Domain Adaptation in Agriculture [4.549305421261851]
We propose DODA, a data synthesizer that can generate high-quality object detection data for new domains in agriculture.
Specifically, we improve the controllability of layout-to-image through encoding layout as an image, thereby improving the quality of labels.
arXiv Detail & Related papers (2024-03-27T08:16:33Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data
Generation [91.01581867841894]
We propose the GeoDiffusion, a simple framework that can flexibly translate various geometric conditions into text prompts.
Our GeoDiffusion is able to encode not only the bounding boxes but also extra geometric conditions such as camera views in self-driving scenes.
arXiv Detail & Related papers (2023-06-07T17:17:58Z) - T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified
Visual Modalities [69.16656086708291]
Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces.
We propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning.
The model can be scaled to generate high-resolution data while unifying multiple modalities.
arXiv Detail & Related papers (2023-05-24T03:32:03Z) - Controllable Image Generation via Collage Representations [31.456445433105415]
"Mixing and matching scenes" (M&Ms) is an approach that consists of an adversarially trained generative image model conditioned on appearance features and spatial positions of the different elements in a collage.
We show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity.
arXiv Detail & Related papers (2023-04-26T17:58:39Z) - LayoutDiffusion: Controllable Diffusion Model for Layout-to-image
Generation [46.567682868550285]
We propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works.
In this paper, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.
Our experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG Code.
arXiv Detail & Related papers (2023-03-30T06:56:12Z) - Multi-Projection Fusion and Refinement Network for Salient Object
Detection in 360{\deg} Omnidirectional Image [141.10227079090419]
We propose a Multi-Projection Fusion and Refinement Network (MPFR-Net) to detect the salient objects in 360deg omnidirectional image.
MPFR-Net uses the equirectangular projection image and four corresponding cube-unfolding images as inputs.
Experimental results on two omnidirectional datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-23T14:50:40Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.