LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
- URL: http://arxiv.org/abs/2412.02193v1
- Date: Tue, 03 Dec 2024 06:15:04 GMT
- Title: LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
- Authors: Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu,
- Abstract summary: Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction.
Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions.
We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models.
- Score: 57.92316645992816
- License:
- Abstract: Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in cluttered scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.
Related papers
- AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.
Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z) - EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing [114.14164860467227]
We propose Edit-Room, a framework capable of executing a variety of layout edits through natural language commands.
Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes.
We have developed an automatic pipeline to augment existing 3D scene datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs.
arXiv Detail & Related papers (2024-10-03T17:42:24Z) - LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model [58.24851949945434]
LLplace is a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3.
LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation.
Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.
arXiv Detail & Related papers (2024-06-06T08:53:01Z) - Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning [24.162598399141785]
Scene-LLM is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments.
Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning.
arXiv Detail & Related papers (2024-03-18T01:18:48Z) - Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback [20.151147653552155]
Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
arXiv Detail & Related papers (2023-05-25T07:43:39Z) - LayoutGPT: Compositional Visual Planning and Generation with Large
Language Models [98.81962282674151]
Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions.
We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language.
arXiv Detail & Related papers (2023-05-24T17:56:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.