Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback
- URL: http://arxiv.org/abs/2305.15808v1
- Date: Thu, 25 May 2023 07:43:39 GMT
- Title: Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback
- Authors: Yiqi Lin, Hao Wu, Ruichen Wang, Haonan Lu, Xiaodong Lin, Hui Xiong,
Lin Wang
- Abstract summary: Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
- Score: 20.151147653552155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating and editing a 3D scene guided by natural language poses a
challenge, primarily due to the complexity of specifying the positional
relations and volumetric changes within the 3D space. Recent advancements in
Large Language Models (LLMs) have demonstrated impressive reasoning,
conversational, and zero-shot generation abilities across various domains.
Surprisingly, these models also show great potential in realizing and
interpreting the 3D space. In light of this, we propose a novel language-guided
interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D
layout interpreter into the off-the-shelf layout-to-3D generative models,
allowing users to flexibly and interactively generate visual content.
Specifically, we design a versatile layout structure base on the bounding boxes
and semantics to prompt the LLMs to model the spatial generation and reasoning
from language. Our system also incorporates LLaVA, a large language and vision
assistant, to provide generative feedback from the visual aspect for improving
the visual quality of generated content. We validate the effectiveness of LI3D,
primarily in 3D generation and editing through multi-round interactions, which
can be flexibly extended to 2D generation and editing. Various experiments
demonstrate the potential benefits of incorporating LLMs in generative AI for
applications, e.g., metaverse. Moreover, we benchmark the layout reasoning
performance of LLMs with neural visual artist tasks, revealing their emergent
ability in the spatial layout domain.
Related papers
- SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks.
We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z) - Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models [57.30913211264333]
We present Story3D-Agent, a pioneering approach that transforms provided narratives into 3D-rendered visualizations.
By integrating procedural modeling, our approach enables precise control over multi-character actions and motions, as well as diverse decorative elements.
We have thoroughly evaluated our Story3D-Agent to validate its effectiveness, offering a basic framework to advance 3D story representation.
arXiv Detail & Related papers (2024-08-21T17:43:15Z) - VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification [56.211321810408194]
Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks.
We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass.
Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
arXiv Detail & Related papers (2024-06-08T18:17:09Z) - LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model [58.24851949945434]
LLplace is a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3.
LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation.
Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.
arXiv Detail & Related papers (2024-06-06T08:53:01Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - GPT4Point: A Unified Framework for Point-Language Understanding and
Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A.
It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - Visualization in the Era of Artificial Intelligence: Experiments for
Creating Structural Visualizations by Prompting Large Language Models [0.0]
Large Language Models (LLMs) have revolutionized natural language processing by generating human-like text and images from textual input.
We report initial experiments showing that LLMs can generate 2D/3D visualizations that may be used for legal visualization.
arXiv Detail & Related papers (2023-05-05T09:16:59Z) - LERF: Language Embedded Radiance Fields [35.925752853115476]
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF.
LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays.
After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
arXiv Detail & Related papers (2023-03-16T17:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.