ControlDreamer: Blending Geometry and Style in Text-to-3D
- URL: http://arxiv.org/abs/2312.01129v3
- Date: Fri, 23 Aug 2024 01:05:36 GMT
- Title: ControlDreamer: Blending Geometry and Style in Text-to-3D
- Authors: Yeongtak Oh, Jooyoung Choi, Yongsung Kim, Minjun Park, Chaehun Shin, Sungroh Yoon,
- Abstract summary: We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on datasets from a carefully curated text corpus.
Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models.
- Score: 34.92628800597151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation. Building upon these developments, we aim to address the limitations of current methods in blending geometries and styles in text-to-3D generation. We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on generated datasets from a carefully curated text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models. Additionally, we present a comprehensive benchmark for 3D style editing, encompassing a broad range of subjects, including objects, animals, and characters, to further facilitate research on diverse 3D generation. Our comparative analysis reveals that this new pipeline outperforms existing text-to-3D methods as evidenced by human evaluations and CLIP score metrics. Project page: https://controldreamer.github.io
Related papers
- SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation.
We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level.
We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - Text-to-3D Shape Generation [18.76771062964711]
Computational systems that can perform text-to-3D shape generation have captivated the popular imagination.
We provide a survey of the underlying technology and methods enabling text-to-3D shape generation to summarize the background literature.
We then derive a systematic categorization of recent work on text-to-3D shape generation based on the type of supervision data required.
arXiv Detail & Related papers (2024-03-20T04:03:44Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data.
We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z) - VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder [56.59814904526965]
This paper introduces a pioneering 3D encoder designed for text-to-3D generation.
A lightweight network is developed to efficiently acquire feature volumes from multi-view images.
The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.
arXiv Detail & Related papers (2023-12-18T18:59:05Z) - Mono3DVG: 3D Visual Grounding in Monocular Images [12.191320182791483]
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information.
We build a large-scale dataset, Mono3DRefer, which contains 3D object targets with corresponding geometric text descriptions.
We propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings.
arXiv Detail & Related papers (2023-12-13T09:49:59Z) - LucidDreaming: Controllable Object-Centric 3D Generation [10.646855651524387]
We present a pipeline capable of spatial and numerical control over 3D generation from only textual prompt commands or 3D bounding boxes.
LucidDreaming achieves superior results in object placement precision and generation fidelity compared to current approaches.
arXiv Detail & Related papers (2023-11-30T18:55:23Z) - Control3D: Towards Controllable Text-to-3D Generation [107.81136630589263]
We present a text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D.
A 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF.
We exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene.
arXiv Detail & Related papers (2023-11-09T15:50:32Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.