POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion
- URL: http://arxiv.org/abs/2601.14056v1
- Date: Tue, 20 Jan 2026 15:13:43 GMT
- Title: POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion
- Authors: Andrea Rigo, Luca Stornaiuolo, Weijie Wang, Mauro Martino, Bruno Lepri, Nicu Sebe,
- Abstract summary: We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing.<n>We introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff)<n>Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes.
- Score: 46.97254555348757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.
Related papers
- Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing [63.141976759536625]
We propose Interp3D, a training-free framework for textured 3D morphing.<n>It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence.<n>For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility.
arXiv Detail & Related papers (2026-01-20T16:03:22Z) - Dragging with Geometry: From Pixels to Geometry-Guided Image Editing [42.176957681367185]
We propose a novel geometry-guided drag-based image editing method - GeoDrag.<n>Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing.
arXiv Detail & Related papers (2025-09-30T03:53:11Z) - FreeInsert: Personalized Object Insertion with Geometric and Style Control [26.088650452374726]
We propose a training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information.<n>The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters.
arXiv Detail & Related papers (2025-09-25T05:26:10Z) - Training-free Geometric Image Editing on Diffusion Models [53.38549950608886]
We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped.<n>We propose a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement.<n>Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine.
arXiv Detail & Related papers (2025-07-31T07:36:00Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - 3DOT: Texture Transfer for 3DGS Objects from a Single Reference Image [31.972069558992946]
3D texture swapping allows for the customization of 3D object textures.<n>No dedicated method exists, but adapted 2D editing and text-driven 3D editing approaches can serve this purpose.<n>We introduce 3DSwapping, a 3D texture swapping method that integrates progressive generation, view-consistency gradient guidance, and prompt-tuned gradient guidance.
arXiv Detail & Related papers (2025-03-24T16:31:52Z) - Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information [4.956066467858058]
We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing.<n>Our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-14T17:15:26Z) - Diffusion-Based Attention Warping for Consistent 3D Scene Editing [55.2480439325792]
We present a novel method for 3D scene editing using diffusion models.<n>Our approach leverages attention features extracted from a single reference image to define the intended edits.<n>Injecting these warped features into other viewpoints enables coherent propagation of edits.
arXiv Detail & Related papers (2024-12-10T23:57:18Z) - SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing [58.22339174221563]
We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing.
SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent.
Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures.
arXiv Detail & Related papers (2024-06-25T09:17:35Z) - Vox-E: Text-guided Voxel Editing of 3D Objects [14.88446525549421]
Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images.
We present a technique that harnesses the power of latent diffusion models for editing existing 3D objects.
arXiv Detail & Related papers (2023-03-21T17:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.