Static and Animated 3D Scene Generation from Free-form Text Descriptions
- URL: http://arxiv.org/abs/2010.01549v2
- Date: Sat, 28 Nov 2020 19:28:30 GMT
- Title: Static and Animated 3D Scene Generation from Free-form Text Descriptions
- Authors: Faria Huq, Nafees Ahmed, Anindya Iqbal
- Abstract summary: We study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description.
In the first stage, we encode the free-form text using an encoder-decoder neural architecture.
In the second stage, we generate a 3D scene based on the generated encoding.
- Score: 1.102914654802229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating coherent and useful image/video scenes from a free-form textual
description is technically a very difficult problem to handle. Textual
description of the same scene can vary greatly from person to person, or
sometimes even for the same person from time to time. As the choice of words
and syntax vary while preparing a textual description, it is challenging for
the system to reliably produce a consistently desirable output from different
forms of language input. The prior works of scene generation have been mostly
confined to rigorous sentence structures of text input which restrict the
freedom of users to write description. In our work, we study a new pipeline
that aims to generate static as well as animated 3D scenes from different types
of free-form textual scene description without any major restriction. In
particular, to keep our study practical and tractable, we focus on a small
subspace of all possible 3D scenes, containing various combinations of cube,
cylinder and sphere. We design a two-stage pipeline. In the first stage, we
encode the free-form text using an encoder-decoder neural architecture. In the
second stage, we generate a 3D scene based on the generated encoding. Our
neural architecture exploits state-of-the-art language model as encoder to
leverage rich contextual encoding and a new multi-head decoder to predict
multiple features of an object in the scene simultaneously. For our
experiments, we generate a large synthetic data-set which contains 13,00,000
and 14,00,000 samples of unique static and animated scene descriptions,
respectively. We achieve 98.427% accuracy on test data set in detecting the 3D
objects features successfully. Our work shows a proof of concept of one
approach towards solving the problem, and we believe with enough training data,
the same pipeline can be expanded to handle even broader set of 3D scene
generation problems.
Related papers
- SceneCraft: Layout-Guided 3D Scene Generation [29.713491313796084]
SceneCraft is a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences.
Our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality.
arXiv Detail & Related papers (2024-10-11T17:59:58Z) - 3D Vision and Language Pretraining with Large-Scale Synthetic Data [28.45763758308814]
3D Vision-Language Pre-training aims to provide a pre-train model which can bridge 3D scenes with natural language.
We construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels.
We propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift.
arXiv Detail & Related papers (2024-07-08T16:26:52Z) - GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts [48.28000728061778]
We propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene.
Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model.
arXiv Detail & Related papers (2024-04-08T18:24:12Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields [29.907615852310204]
We present Text2NeRF, which is able to generate a wide range of 3D scenes purely from a text prompt.
Our method requires no additional training data but only a natural language description of the scene as the input.
arXiv Detail & Related papers (2023-05-19T10:58:04Z) - Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency.
The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z) - DisCoScene: Spatially Disentangled Generative Radiance Fields for
Controllable 3D-aware Scene Synthesis [90.32352050266104]
DisCoScene is a 3Daware generative model for high-quality and controllable scene synthesis.
It disentangles the whole scene into object-centric generative fields by learning on only 2D images with the global-local discrimination.
We demonstrate state-of-the-art performance on many scene datasets, including the challenging outdoor dataset.
arXiv Detail & Related papers (2022-12-22T18:59:59Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.