Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image
Synthesis
- URL: http://arxiv.org/abs/2401.09048v1
- Date: Wed, 17 Jan 2024 08:30:47 GMT
- Title: Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image
Synthesis
- Authors: Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun
Jeong
- Abstract summary: We present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics.
Our integrated framework, textscCompose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner.
- Score: 12.490787443456636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Addressing the limitations of text as a source of accurate layout
representation in text-conditional diffusion models, many works incorporate
additional signals to condition certain attributes within a generated image.
Although successful, previous works do not account for the specific
localization of said attributes extended into the three dimensional plane. In
this context, we present a conditional diffusion model that integrates control
over three-dimensional object placement with disentangled representations of
global stylistic semantics from multiple exemplar images. Specifically, we
first introduce \textit{depth disentanglement training} to leverage the
relative depth of objects as an estimator, allowing the model to identify the
absolute positions of unseen objects through the use of synthetic image
triplets. We also introduce \textit{soft guidance}, a method for imposing
global semantics onto targeted regions without the use of any additional
localization cues. Our integrated framework, \textsc{Compose and Conquer
(CnC)}, unifies these techniques to localize multiple conditions in a
disentangled manner. We demonstrate that our approach allows perception of
objects at varying depths while offering a versatile framework for composing
localized objects with different global semantics. Code:
https://github.com/tomtom1103/compose-and-conquer/
Related papers
- Rethinking Referring Object Removal [9.906943507715779]
We construct a dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs.
Each pair contains an image with referring expressions and the ground truth after elimination.
We propose an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure.
arXiv Detail & Related papers (2024-03-14T06:26:34Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - TopNet: Transformer-based Object Placement Network for Image Compositing [43.14411954867784]
Local clues in background images are important to determine the compatibility of placing objects with certain locations/scales.
We propose to learn the correlation between object features and all local background features with a transformer module.
Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass.
arXiv Detail & Related papers (2023-04-06T20:58:49Z) - Compositional 3D Scene Generation using Locally Conditioned Diffusion [49.5784841881488]
We introduce textbflocally conditioned diffusion as an approach to compositional scene diffusion.
We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
arXiv Detail & Related papers (2023-03-21T22:37:16Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - i3dLoc: Image-to-range Cross-domain Localization Robust to Inconsistent
Environmental Conditions [9.982307144353713]
We present a method for localizing a single camera with respect to a point cloud map in indoor and outdoor scenes.
Our method can match equirectangular images to the 3D range projections by extracting cross-domain symmetric place descriptors.
With a single trained model, i3dLoc can demonstrate reliable visual localization in random conditions.
arXiv Detail & Related papers (2021-05-27T00:13:11Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z) - Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis [194.1452124186117]
We propose a novel ECGAN for the challenging semantic image synthesis task.
Our ECGAN achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2020-03-31T01:23:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.