Fast Multi-view Consistent 3D Editing with Video Priors
- URL: http://arxiv.org/abs/2511.23172v2
- Date: Mon, 01 Dec 2025 12:29:25 GMT
- Title: Fast Multi-view Consistent 3D Editing with Video Priors
- Authors: Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, Lei Zhang,
- Abstract summary: We propose generative Video Prior based 3D Editing (ViP3DE)<n>Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly.<n>Our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
- Score: 19.790628738739354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
Related papers
- Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing [106.07976338405793]
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm.<n>We propose textbfRL3DEdit, a single-pass framework driven by reinforcement learning with novel rewards derived from the 3D foundation model, VGGT.<n>Experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency.
arXiv Detail & Related papers (2026-03-03T16:31:10Z) - Edit3r: Instant 3D Scene Editing from Sparse Unposed Images [40.421700685587346]
We present Edit3r, a framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images.<n>We show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines.
arXiv Detail & Related papers (2025-12-31T18:59:53Z) - C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing [37.439731931558036]
C3Editor is a controllable and consistent 2D-lifting-based 3D editing framework.<n>Our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results.<n>Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods.
arXiv Detail & Related papers (2025-10-06T07:07:14Z) - 3D-LATTE: Latent Space 3D Editing from Textual Instructions [64.77718887666312]
We propose a training-free editing method that operates within the latent space of a native 3D diffusion model.<n>We guide the edit synthesis by blending 3D attention maps from the generation with the source object.
arXiv Detail & Related papers (2025-08-29T22:51:59Z) - Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy [48.72918598961146]
We present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing.<n>Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition.
arXiv Detail & Related papers (2025-06-27T17:59:01Z) - 3DEgo: 3D Editing on the Go! [6.072473323242202]
We introduce 3DEgo to address a novel problem of directly synthesizing 3D scenes from monocular videos guided by textual prompts.
Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow.
3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources.
arXiv Detail & Related papers (2024-07-14T07:03:50Z) - DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing [72.54566271694654]
We consider the problem of editing 3D objects and scenes based on open-ended language instructions.<n>A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process.<n>This process is often inefficient due to the need for iterative updates of costly 3D representations.
arXiv Detail & Related papers (2024-04-29T17:59:30Z) - View-Consistent 3D Editing with Gaussian Splatting [50.6460814430094]
View-consistent Editing (VcEdit) is a novel framework that seamlessly incorporates 3DGS into image editing processes.<n>By incorporating consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency.
arXiv Detail & Related papers (2024-03-18T15:22:09Z) - GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing [38.948892064761914]
GaussCtrl is a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS)
Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image.
arXiv Detail & Related papers (2024-03-13T17:35:28Z) - Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview
Correspondence-Enhanced Diffusion Models [83.97844535389073]
A major obstacle hindering the widespread adoption of 3D content editing is its time-intensive processing.
We propose that by incorporating correspondence regularization into diffusion models, the process of 3D editing can be significantly accelerated.
In most scenarios, our proposed technique brings a 10$times$ speed-up compared to the baseline method and completes the editing of a 3D scene in 2 minutes with comparable quality.
arXiv Detail & Related papers (2023-12-13T23:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.