Related papers: BIFRÖST: 3D-Aware Image compositing with Language Instructions

BIFRÖST: 3D-Aware Image compositing with Language Instructions

URL: http://arxiv.org/abs/2410.19079v2
Date: Mon, 28 Oct 2024 18:40:19 GMT
Title: BIFRÖST: 3D-Aware Image compositing with Language Instructions
Authors: Lingxiao Li, Kaixiong Gong, Weihong Li, Xili Dai, Tao Chen, Xiaojun Yuan, Xiangyu Yue,
Abstract summary: Bifr"ost is a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Bifr"ost addresses issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process.
Score: 27.484947109237964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces Bifr\"ost, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships ($\textit{e.g.}$, occlusion). Bifr\"ost addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifr\"ost significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

Related papers

ComposeAnything: Composite Object Priors for Text-to-Image Generation [72.98469853839246]
ComposeAnything is a novel framework for improving compositional image generation without retraining existing T2I models.<n>Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text.<n>Our model generates high-quality images with compositions that faithfully reflect the text.
arXiv Detail & Related papers (2025-05-30T00:13:36Z)
MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation [19.46962637673285]
MV-CoLight is a framework for illumination-consistent object compositing in 2D and 3D scenes.<n>We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly.<n> Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset.
arXiv Detail & Related papers (2025-05-27T17:53:02Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction [25.468907201804093]
Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, 2D diffusion models often struggle to produce dense images with strong multi-view consistency. We present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view.
arXiv Detail & Related papers (2025-03-11T03:08:43Z)
Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation. It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
Semantic Score Distillation Sampling for Compositional Text-to-3D Generation [28.88237230872795]
Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research. We introduce a novel SDS approach, designed to improve the expressiveness and accuracy of compositional text-to-3D generation. Our approach integrates new semantic embeddings that maintain consistency across different rendering views. By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-11T17:26:00Z)
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts. We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z)
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation [61.48050470095969]
X-Dreamer is a novel approach for high-quality text-to-3D content creation. It bridges the gap between text-to-2D and text-to-3D synthesis.
arXiv Detail & Related papers (2023-11-30T07:23:00Z)
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z)
Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models. Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z)
RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map. We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z)
Translational Symmetry-Aware Facade Parsing for 3D Building Reconstruction [11.263458202880038]
In this paper, we present a novel translational symmetry-based approach to improving the deep neural networks. We propose a novel scheme to fuse anchor-free detection in a single stage network, which enables the efficient training and better convergence. We employ an off-the-shelf rendering engine like Blender to reconstruct the realistic high-quality 3D models using procedural modeling.
arXiv Detail & Related papers (2021-06-02T03:10:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.