Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract
Scene Descriptions
- URL: http://arxiv.org/abs/2306.06212v1
- Date: Fri, 9 Jun 2023 19:24:39 GMT
- Title: Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract
Scene Descriptions
- Authors: Ian Huang, Vrishab Krishna, Omoruyi Atekha, Leonidas Guibas
- Abstract summary: We present a system to generate stylized assets for 3D scenes described by a short phrase.
It is robust to open-world concepts in a way that traditional methods trained on limited data are not, more creative freedom to the 3D artist.
- Score: 0.19116784879310023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What constitutes the "vibe" of a particular scene? What should one find in "a
busy, dirty city street", "an idyllic countryside", or "a crime scene in an
abandoned living room"? The translation from abstract scene descriptions to
stylized scene elements cannot be done with any generality by extant systems
trained on rigid and limited indoor datasets. In this paper, we propose to
leverage the knowledge captured by foundation models to accomplish this
translation. We present a system that can serve as a tool to generate stylized
assets for 3D scenes described by a short phrase, without the need to enumerate
the objects to be found within the scene or give instructions on their
appearance. Additionally, it is robust to open-world concepts in a way that
traditional methods trained on limited data are not, affording more creative
freedom to the 3D artist. Our system demonstrates this using a foundation model
"team" composed of a large language model, a vision-language model and several
image diffusion models, which communicate using an interpretable and
user-editable intermediate representation, thus allowing for more versatile and
controllable stylized asset generation for 3D artists. We introduce novel
metrics for this task, and show through human evaluations that in 91% of the
cases, our system outputs are judged more faithful to the semantics of the
input scene description than the baseline, thus highlighting the potential of
this approach to radically accelerate the 3D content creation process for 3D
artists.
Related papers
- 3D Vision and Language Pretraining with Large-Scale Synthetic Data [28.45763758308814]
3D Vision-Language Pre-training aims to provide a pre-train model which can bridge 3D scenes with natural language.
We construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels.
We propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift.
arXiv Detail & Related papers (2024-07-08T16:26:52Z) - Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding.
We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding.
A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes.
Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z) - DisCoScene: Spatially Disentangled Generative Radiance Fields for
Controllable 3D-aware Scene Synthesis [90.32352050266104]
DisCoScene is a 3Daware generative model for high-quality and controllable scene synthesis.
It disentangles the whole scene into object-centric generative fields by learning on only 2D images with the global-local discrimination.
We demonstrate state-of-the-art performance on many scene datasets, including the challenging outdoor dataset.
arXiv Detail & Related papers (2022-12-22T18:59:59Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Static and Animated 3D Scene Generation from Free-form Text Descriptions [1.102914654802229]
We study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description.
In the first stage, we encode the free-form text using an encoder-decoder neural architecture.
In the second stage, we generate a 3D scene based on the generated encoding.
arXiv Detail & Related papers (2020-10-04T11:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.