Text-to-Scene with Large Reasoning Models
- URL: http://arxiv.org/abs/2509.26091v1
- Date: Tue, 30 Sep 2025 11:08:11 GMT
- Title: Text-to-Scene with Large Reasoning Models
- Authors: Frédéric Berdoz, Luca A. Lanzendörfer, Nick Tuninga, Roger Wattenhofer,
- Abstract summary: Reason-3D is a text-to-scene model powered by large reasoning models (LRMs)<n>Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes.<n>It significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality.
- Score: 35.61634772862795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.
Related papers
- PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement [89.35154754765502]
PhyScensis is an agent-based framework powered by a physics engine to produce physically plausible scene configurations.<n>Our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters.<n> Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy.
arXiv Detail & Related papers (2026-02-16T17:55:25Z) - Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z) - Causal Reasoning Elicits Controllable 3D Scene Generation [35.22855710229319]
CausalStruct is a novel framework that embeds causal reasoning into 3D scene generation.<n>We construct causal graphs where nodes represent objects and attributes, while edges encode causal dependencies and physical constraints.<n>Our method uses text or images to guide object placement and layout in 3D scenes, with 3D Gaussian Splatting and Score Distillation Sampling improving shape accuracy and rendering stability.
arXiv Detail & Related papers (2025-09-18T01:03:21Z) - SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z) - Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions [28.185661905201222]
Descrip3D is a novel framework that explicitly encodes the relationships between objects using natural language.<n>It allows for unified reasoning across various tasks such as grounding, captioning, and question answering.
arXiv Detail & Related papers (2025-07-19T09:19:16Z) - A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z) - ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with Preference Alignment [8.954070942391603]
ReSpace is a generative framework for text-driven 3D indoor scene synthesis and editing.<n>We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment.<n>For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition.
arXiv Detail & Related papers (2025-06-03T05:22:04Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation [36.44409268300039]
Scenethesis is a framework that integrates text-based scene planning with vision-guided layout refinement.<n>It generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.
arXiv Detail & Related papers (2025-05-05T17:59:58Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.