Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback
- URL: http://arxiv.org/abs/2305.15808v1
- Date: Thu, 25 May 2023 07:43:39 GMT
- Title: Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback
- Authors: Yiqi Lin, Hao Wu, Ruichen Wang, Haonan Lu, Xiaodong Lin, Hui Xiong,
Lin Wang
- Abstract summary: Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
- Score: 20.151147653552155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating and editing a 3D scene guided by natural language poses a
challenge, primarily due to the complexity of specifying the positional
relations and volumetric changes within the 3D space. Recent advancements in
Large Language Models (LLMs) have demonstrated impressive reasoning,
conversational, and zero-shot generation abilities across various domains.
Surprisingly, these models also show great potential in realizing and
interpreting the 3D space. In light of this, we propose a novel language-guided
interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D
layout interpreter into the off-the-shelf layout-to-3D generative models,
allowing users to flexibly and interactively generate visual content.
Specifically, we design a versatile layout structure base on the bounding boxes
and semantics to prompt the LLMs to model the spatial generation and reasoning
from language. Our system also incorporates LLaVA, a large language and vision
assistant, to provide generative feedback from the visual aspect for improving
the visual quality of generated content. We validate the effectiveness of LI3D,
primarily in 3D generation and editing through multi-round interactions, which
can be flexibly extended to 2D generation and editing. Various experiments
demonstrate the potential benefits of incorporating LLMs in generative AI for
applications, e.g., metaverse. Moreover, we benchmark the layout reasoning
performance of LLMs with neural visual artist tasks, revealing their emergent
ability in the spatial layout domain.
Related papers
- g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z) - LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models [62.85566496673856]
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model.
A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly.
Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format.
arXiv Detail & Related papers (2024-11-14T17:08:23Z) - SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks.
We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z) - VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification [56.211321810408194]
Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks.
We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass.
Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
arXiv Detail & Related papers (2024-06-08T18:17:09Z) - LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model [58.24851949945434]
LLplace is a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3.
LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation.
Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.
arXiv Detail & Related papers (2024-06-06T08:53:01Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - GPT4Point: A Unified Framework for Point-Language Understanding and
Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A.
It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z) - Visualization in the Era of Artificial Intelligence: Experiments for
Creating Structural Visualizations by Prompting Large Language Models [0.0]
Large Language Models (LLMs) have revolutionized natural language processing by generating human-like text and images from textual input.
We report initial experiments showing that LLMs can generate 2D/3D visualizations that may be used for legal visualization.
arXiv Detail & Related papers (2023-05-05T09:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.