Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback
- URL: http://arxiv.org/abs/2305.15808v1
- Date: Thu, 25 May 2023 07:43:39 GMT
- Title: Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback
- Authors: Yiqi Lin, Hao Wu, Ruichen Wang, Haonan Lu, Xiaodong Lin, Hui Xiong,
Lin Wang
- Abstract summary: Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
- Score: 20.151147653552155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating and editing a 3D scene guided by natural language poses a
challenge, primarily due to the complexity of specifying the positional
relations and volumetric changes within the 3D space. Recent advancements in
Large Language Models (LLMs) have demonstrated impressive reasoning,
conversational, and zero-shot generation abilities across various domains.
Surprisingly, these models also show great potential in realizing and
interpreting the 3D space. In light of this, we propose a novel language-guided
interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D
layout interpreter into the off-the-shelf layout-to-3D generative models,
allowing users to flexibly and interactively generate visual content.
Specifically, we design a versatile layout structure base on the bounding boxes
and semantics to prompt the LLMs to model the spatial generation and reasoning
from language. Our system also incorporates LLaVA, a large language and vision
assistant, to provide generative feedback from the visual aspect for improving
the visual quality of generated content. We validate the effectiveness of LI3D,
primarily in 3D generation and editing through multi-round interactions, which
can be flexibly extended to 2D generation and editing. Various experiments
demonstrate the potential benefits of incorporating LLMs in generative AI for
applications, e.g., metaverse. Moreover, we benchmark the layout reasoning
performance of LLMs with neural visual artist tasks, revealing their emergent
ability in the spatial layout domain.
Related papers
- VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification [56.211321810408194]
Large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks.
We present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass.
Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
arXiv Detail & Related papers (2024-06-08T18:17:09Z) - LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model [58.24851949945434]
LLplace is a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3.
LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation.
Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.
arXiv Detail & Related papers (2024-06-06T08:53:01Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning [24.162598399141785]
Scene-LLM is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments.
Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning.
arXiv Detail & Related papers (2024-03-18T01:18:48Z) - Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with
Large Language Models [71.2931570433261]
We introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes.
Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects.
arXiv Detail & Related papers (2024-01-09T06:20:23Z) - GPT4Point: A Unified Framework for Point-Language Understanding and
Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A.
It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - Visualization in the Era of Artificial Intelligence: Experiments for
Creating Structural Visualizations by Prompting Large Language Models [0.0]
Large Language Models (LLMs) have revolutionized natural language processing by generating human-like text and images from textual input.
We report initial experiments showing that LLMs can generate 2D/3D visualizations that may be used for legal visualization.
arXiv Detail & Related papers (2023-05-05T09:16:59Z) - LERF: Language Embedded Radiance Fields [35.925752853115476]
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF.
LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays.
After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
arXiv Detail & Related papers (2023-03-16T17:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.