Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
- URL: http://arxiv.org/abs/2510.14981v1
- Date: Thu, 16 Oct 2025 17:59:59 GMT
- Title: Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
- Authors: Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu,
- Abstract summary: Pre-trained 2D image editing models can independently produce high-quality edits for each image in a set of multi-view images, but they do not maintain consistency across views.<n>We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution.<n>This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution.
- Score: 27.884718441352387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
Related papers
- MEt3R: Measuring Multi-View Consistency in Generated Images [47.152540564255204]
We introduce MEt3R, a metric for multi-view consistency in generated images.<n>Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner.
arXiv Detail & Related papers (2025-01-10T20:43:33Z) - Towards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion [63.81544586407943]
Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations.
We propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits.
Experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image.
arXiv Detail & Related papers (2024-11-15T17:19:18Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model.
Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z) - VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views.
We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images.
We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z) - MVDream: Multi-view Diffusion for 3D Generation [14.106283556521962]
We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt.
Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings.
arXiv Detail & Related papers (2023-08-31T07:49:06Z) - Collaborative Score Distillation for Consistent Visual Synthesis [70.29294250371312]
Collaborative Score Distillation (CSD) is based on the Stein Variational Gradient Descent (SVGD)
We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes.
Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
arXiv Detail & Related papers (2023-07-04T17:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.