Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine
- URL: http://arxiv.org/abs/2511.13713v1
- Date: Mon, 17 Nov 2025 18:57:39 GMT
- Title: Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine
- Authors: Xincheng Shuai, Zhenyuan Qin, Henghui Ding, Dacheng Tao,
- Abstract summary: We present FFSE, a 3D-aware framework designed to enable intuitive, physically-consistent object editing on real-world images.<n>Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations.<n>To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor.
- Score: 83.0145525456509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
Related papers
- ShapeUP: Scalable Image-Conditioned 3D Editing [44.63222737714384]
ShapeUP is a scalable, image-conditioned 3D editing framework.<n>It formulates editing as a supervised latent-to-latent translation within a native 3D representation.<n>Our evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity.
arXiv Detail & Related papers (2026-02-05T13:59:16Z) - Edit3r: Instant 3D Scene Editing from Sparse Unposed Images [40.421700685587346]
We present Edit3r, a framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images.<n>We show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines.
arXiv Detail & Related papers (2025-12-31T18:59:53Z) - 3D-LATTE: Latent Space 3D Editing from Textual Instructions [64.77718887666312]
We propose a training-free editing method that operates within the latent space of a native 3D diffusion model.<n>We guide the edit synthesis by blending 3D attention maps from the generation with the source object.
arXiv Detail & Related papers (2025-08-29T22:51:59Z) - 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting [31.98493679748211]
We propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of 3D scenes using Gaussian Splatting.<n>Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits.
arXiv Detail & Related papers (2024-12-02T15:03:55Z) - 3DEgo: 3D Editing on the Go! [6.072473323242202]
We introduce 3DEgo to address a novel problem of directly synthesizing 3D scenes from monocular videos guided by textual prompts.
Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow.
3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources.
arXiv Detail & Related papers (2024-07-14T07:03:50Z) - Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts [76.73043724587679]
We propose a dialogue-based 3D scene editing approach, termed CE3D.
Hash-Atlas represents 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images.
Results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects.
arXiv Detail & Related papers (2024-07-09T13:24:42Z) - DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation [57.406031264184584]
DragGaussian is a 3D object drag-editing framework based on 3D Gaussian Splatting.
Our contributions include the introduction of a new task, the development of DragGaussian for interactive point-based 3D editing, and comprehensive validation of its effectiveness through qualitative and quantitative experiments.
arXiv Detail & Related papers (2024-05-09T14:34:05Z) - DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing [72.54566271694654]
We consider the problem of editing 3D objects and scenes based on open-ended language instructions.<n>A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process.<n>This process is often inefficient due to the need for iterative updates of costly 3D representations.
arXiv Detail & Related papers (2024-04-29T17:59:30Z) - View-Consistent 3D Editing with Gaussian Splatting [50.6460814430094]
View-consistent Editing (VcEdit) is a novel framework that seamlessly incorporates 3DGS into image editing processes.<n>By incorporating consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency.
arXiv Detail & Related papers (2024-03-18T15:22:09Z) - Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors [24.478875248825563]
We propose a novel image editing technique that enables 3D manipulations on single images.
Our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs.
Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image.
arXiv Detail & Related papers (2024-03-18T06:18:59Z) - Plasticine3D: 3D Non-Rigid Editing with Text Guidance by Multi-View Embedding Optimization [21.8454418337306]
We propose Plasticine3D, a novel text-guided controlled 3D editing pipeline that can perform 3D non-rigid editing.
Our work divides the editing process into a geometry editing stage and a texture editing stage to achieve separate control of structure and appearance.
For the purpose of fine-grained control, we propose Embedding-Fusion (EF) to blend the original characteristics with the editing objectives in the embedding space.
arXiv Detail & Related papers (2023-12-15T09:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.