Related papers: Consolidating Attention Features for Multi-view Image Editing

Consolidating Attention Features for Multi-view Image Editing

URL: http://arxiv.org/abs/2402.14792v1
Date: Thu, 22 Feb 2024 18:50:18 GMT
Title: Consolidating Attention Features for Multi-view Image Editing
Authors: Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre
Abstract summary: We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
Score: 126.19731971010475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.

Related papers

Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information [4.956066467858058]
We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-14T17:15:26Z)
SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing [58.22339174221563]
We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures.
arXiv Detail & Related papers (2024-06-25T09:17:35Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z)
View-Consistent 3D Editing with Gaussian Splatting [50.6460814430094]
View-consistent Editing (VcEdit) is a novel framework that seamlessly incorporates 3DGS into image editing processes. By incorporating consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency.
arXiv Detail & Related papers (2024-03-18T15:22:09Z)
GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing [38.948892064761914]
GaussCtrl is a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS) Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image.
arXiv Detail & Related papers (2024-03-13T17:35:28Z)
Diffusion Model-Based Image Editing: A Survey [46.244266782108234]
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks. We provide an exhaustive overview of existing methods using diffusion models for image editing. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval.
arXiv Detail & Related papers (2024-02-27T14:07:09Z)
Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z)
StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.