Related papers: SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

URL: http://arxiv.org/abs/2403.01248v1
Date: Sat, 2 Mar 2024 16:16:26 GMT
Title: SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
Authors: Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi
Abstract summary: SceneCraft is a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts. SceneCraft renders complex scenes with up to a hundred 3D assets. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning.
Score: 76.22337677728109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

Related papers

Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop [32.92038804110175]
Scene Copilot is a framework combining large language models (LLMs) with a procedural 3D scene generator. Scene Codex is designed to translate textual user input into commands understandable by the 3D scene generator. BlenderGPT provides users with an intuitive and direct way to precisely control the generated 3D scene and the final output video.
arXiv Detail & Related papers (2024-11-26T19:21:57Z)
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method. We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning. DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z)
SceneCraft: Layout-Guided 3D Scene Generation [29.713491313796084]
SceneCraft is a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences. Our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality.
arXiv Detail & Related papers (2024-10-11T17:59:58Z)
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling [21.1274747033854]
Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. Milo is a novel framework which can synthesize character videos with controllable attributes. Milo achieves advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes.
arXiv Detail & Related papers (2024-09-24T15:00:07Z)
3D scene generation from scene graphs and self-attention [51.49886604454926]
We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans. We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene.
arXiv Detail & Related papers (2024-04-02T12:26:17Z)
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model [7.707324214953882]
We introduce SceneScript, a method that produces full scene models as a sequence of structured language commands. Our method infers the set of structured language commands directly from encoded visual data. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection.
arXiv Detail & Related papers (2024-03-19T18:01:29Z)
Disentangled 3D Scene Generation with Layout Learning [109.03233745767062]
We introduce a method to generate 3D scenes that are disentangled into their component objects. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. We show that despite its simplicity, our approach successfully generates 3D scenes into individual objects.
arXiv Detail & Related papers (2024-02-26T18:54:15Z)
Blocks2World: Controlling Realistic Scenes with Editable Primitives [5.541644538483947]
We present Blocks2World, a novel method for 3D scene rendering and editing. Our technique begins by extracting 3D parallelepipeds from various objects in a given scene using convex decomposition. The next stage involves training a conditioned model that learns to generate images from the 2D-rendered convex primitives.
arXiv Detail & Related papers (2023-07-07T21:38:50Z)
DORSal: Diffusion for Object-centric Representations of Scenes et al [28.181157214966493]
Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. We propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes.
arXiv Detail & Related papers (2023-06-13T18:32:35Z)
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z)
Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes [68.14127205949073]
We propose a novel GlobalLocal training framework for synthesizing a 3D scene using object proxies. We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object. Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation.
arXiv Detail & Related papers (2023-03-23T17:17:29Z)
Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation [58.16911861917018]
We present a novel method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis. Our model couples learnt scene-specific feature volumes with a scene agnostic neural rendering network. We demonstrate various scene manipulations, including mixing scenes, deforming objects and inserting objects into scenes, while still producing photo-realistic results.
arXiv Detail & Related papers (2022-04-22T17:57:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.