Related papers: FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

URL: http://arxiv.org/abs/2505.01322v2
Date: Sun, 01 Jun 2025 07:44:11 GMT
Title: FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors
Authors: Chenxi Li, Weijie Wang, Qiang Li, Bruno Lepri, Nicu Sebe, Weizhi Nie,
Abstract summary: FreeInsert is a novel framework that disentangles object generation from spatial placement.<n>It achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors.
Score: 67.26107732326948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.

Related papers

Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning [27.872834485482276]
3D indoor scene synthesis is vital for embodied AI and digital content creation.<n>Existing methods fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions.<n>We introduce Direct, a framework that directly generates numerical 3D layouts from text descriptions.
arXiv Detail & Related papers (2025-06-05T17:59:42Z)
ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment [1.0918065824771606]
ReSpace is a generative framework for text-driven 3D indoor scene synthesis and editing.<n>We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment.<n>For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition.
arXiv Detail & Related papers (2025-06-03T05:22:04Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors [52.63385546943866]
We present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions.<n>To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model.<n>Our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-05T12:20:13Z)
Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.<n>voxelization infers per-object occupancy probabilities at individual spatial locations.<n>Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z)
LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model [58.24851949945434]
LLplace is a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.
arXiv Detail & Related papers (2024-06-06T08:53:01Z)
DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling [23.06464506261766]
We present DreamScape, a method for generating 3D scenes from text.<n>We use 3D Gaussian Guide that encodes semantic primitives, spatial transformations and relationships from text using LLMs.<n>DreamScape achieves state-of-the-art performance, enabling high-fidelity, controllable 3D scene generation.
arXiv Detail & Related papers (2024-04-14T12:13:07Z)
3DLatNav: Navigating Generative Latent Spaces for Semantic-Aware 3D Object Manipulation [2.8661021832561757]
3D generative models have been recently successful in generating realistic 3D objects in the form of point clouds. Most models do not offer controllability to manipulate the shape semantics of component object parts without extensive semantic labels or other reference point clouds. We propose 3DLatNav; a novel approach to navigating pretrained generative latent spaces to enable controlled part-level semantic manipulation of 3D objects.
arXiv Detail & Related papers (2022-11-17T18:47:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.