Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image
Synthesis
- URL: http://arxiv.org/abs/2401.09048v1
- Date: Wed, 17 Jan 2024 08:30:47 GMT
- Title: Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image
Synthesis
- Authors: Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun
Jeong
- Abstract summary: We present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics.
Our integrated framework, textscCompose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner.
- Score: 12.490787443456636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Addressing the limitations of text as a source of accurate layout
representation in text-conditional diffusion models, many works incorporate
additional signals to condition certain attributes within a generated image.
Although successful, previous works do not account for the specific
localization of said attributes extended into the three dimensional plane. In
this context, we present a conditional diffusion model that integrates control
over three-dimensional object placement with disentangled representations of
global stylistic semantics from multiple exemplar images. Specifically, we
first introduce \textit{depth disentanglement training} to leverage the
relative depth of objects as an estimator, allowing the model to identify the
absolute positions of unseen objects through the use of synthetic image
triplets. We also introduce \textit{soft guidance}, a method for imposing
global semantics onto targeted regions without the use of any additional
localization cues. Our integrated framework, \textsc{Compose and Conquer
(CnC)}, unifies these techniques to localize multiple conditions in a
disentangled manner. We demonstrate that our approach allows perception of
objects at varying depths while offering a versatile framework for composing
localized objects with different global semantics. Code:
https://github.com/tomtom1103/compose-and-conquer/
Related papers
- Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object.
Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components.
We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z) - Bi-directional Contextual Attention for 3D Dense Captioning [38.022425401910894]
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene.
Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object.
We introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention.
arXiv Detail & Related papers (2024-08-13T06:25:54Z) - FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - TopNet: Transformer-based Object Placement Network for Image Compositing [43.14411954867784]
Local clues in background images are important to determine the compatibility of placing objects with certain locations/scales.
We propose to learn the correlation between object features and all local background features with a transformer module.
Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass.
arXiv Detail & Related papers (2023-04-06T20:58:49Z) - Compositional 3D Scene Generation using Locally Conditioned Diffusion [49.5784841881488]
We introduce textbflocally conditioned diffusion as an approach to compositional scene diffusion.
We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
arXiv Detail & Related papers (2023-03-21T22:37:16Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.