Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields
- URL: http://arxiv.org/abs/2305.11588v2
- Date: Wed, 31 Jan 2024 10:15:08 GMT
- Title: Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields
- Authors: Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao
- Abstract summary: We present Text2NeRF, which is able to generate a wide range of 3D scenes purely from a text prompt.
Our method requires no additional training data but only a natural language description of the scene as the input.
- Score: 29.907615852310204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven 3D scene generation is widely applicable to video gaming, film
industry, and metaverse applications that have a large demand for 3D scenes.
However, existing text-to-3D generation methods are limited to producing 3D
objects with simple geometries and dreamlike styles that lack realism. In this
work, we present Text2NeRF, which is able to generate a wide range of 3D scenes
with complicated geometric structures and high-fidelity textures purely from a
text prompt. To this end, we adopt NeRF as the 3D representation and leverage a
pre-trained text-to-image diffusion model to constrain the 3D reconstruction of
the NeRF to reflect the scene description. Specifically, we employ the
diffusion model to infer the text-related image as the content prior and use a
monocular depth estimation method to offer the geometric prior. Both content
and geometric priors are utilized to update the NeRF model. To guarantee
textured and geometric consistency between different views, we introduce a
progressive scene inpainting and updating strategy for novel view synthesis of
the scene. Our method requires no additional training data but only a natural
language description of the scene as the input. Extensive experiments
demonstrate that our Text2NeRF outperforms existing methods in producing
photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of
natural language prompts. Our code is available at
https://github.com/eckertzhang/Text2NeRF.
Related papers
- SceneCraft: Layout-Guided 3D Scene Generation [29.713491313796084]
SceneCraft is a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences.
Our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality.
arXiv Detail & Related papers (2024-10-11T17:59:58Z) - RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion [39.03289977892935]
RealmDreamer is a technique for generation of general forward-facing 3D scenes from text descriptions.
Our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles.
arXiv Detail & Related papers (2024-04-10T17:57:41Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - Control3D: Towards Controllable Text-to-3D Generation [107.81136630589263]
We present a text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D.
A 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF.
We exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene.
arXiv Detail & Related papers (2023-11-09T15:50:32Z) - Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation
Using only Images [105.92311979305065]
TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D.
The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models.
arXiv Detail & Related papers (2023-08-31T14:26:33Z) - Fantasia3D: Disentangling Geometry and Appearance for High-quality
Text-to-3D Content Creation [45.69270771487455]
We propose a new method of Fantasia3D for high-quality text-to-3D content creation.
Key to Fantasia3D is the disentangled modeling and learning of geometry and appearance.
Our framework is more compatible with popular graphics engines, supporting relighting, editing, and physical simulation of the generated 3D assets.
arXiv Detail & Related papers (2023-03-24T09:30:09Z) - Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models [21.622420436349245]
We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
We leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses.
In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model.
arXiv Detail & Related papers (2023-03-21T16:21:02Z) - Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency.
The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z) - Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and
Text-to-Image Diffusion Models [44.34479731617561]
We introduce explicit 3D shape priors into the CLIP-guided 3D optimization process.
We present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model.
Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy.
arXiv Detail & Related papers (2022-12-28T18:23:47Z) - Zero-Shot Text-Guided Object Generation with Dream Fields [111.06026544180398]
We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects.
Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision.
In experiments, Dream Fields produce realistic, multi-view consistent object geometry and color from a variety of natural language captions.
arXiv Detail & Related papers (2021-12-02T17:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.