Training-Free Text-Guided Image Editing with Visual Autoregressive Model
- URL: http://arxiv.org/abs/2503.23897v1
- Date: Mon, 31 Mar 2025 09:46:56 GMT
- Title: Training-Free Text-Guided Image Editing with Visual Autoregressive Model
- Authors: Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang,
- Abstract summary: We propose a novel text-guided image editing framework based on Visual AutoRegressive modeling.<n>Our method eliminates the need for explicit inversion while ensuring precise and controlled modifications.<n>Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds.
- Score: 46.201510044410995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.
Related papers
- Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation [7.218556478126324]
diffusion model has demonstrated superior performance in diverse and high-quality images for text-guided image translation.<n>We propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss.<n>Our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model.
arXiv Detail & Related papers (2025-03-26T12:15:25Z) - Lost in Edits? A $λ$-Compass for AIGC Provenance [119.95562081325552]
We propose a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones.
LambdaTracer is effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix or performed manually with editing software such as Adobe Photoshop.
arXiv Detail & Related papers (2025-02-05T06:24:25Z) - Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing [66.48853049746123]
We analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps.<n>Our method effectively minimizes distortions caused by varying text conditions during noise prediction.<n> Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios.
arXiv Detail & Related papers (2024-11-29T12:11:28Z) - Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing [28.904419606450876]
We present a Vision-guided and Mask-enhanced Adaptive Editing (ViMAEdit) method with three key novel designs.
First, we propose to leverage image embeddings as explicit guidance to enhance the conventional textual prompt-based denoising process.
Second, we devise a self-attention-guided iterative editing area grounding strategy.
arXiv Detail & Related papers (2024-10-14T13:41:37Z) - Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing [42.73883397041092]
We propose a novel approach that is built upon a modified diffusion sampling process via the guidance mechanism.
In this work, we explore the self-guidance technique to preserve the overall structure of the input image.
We show through human evaluation and quantitative analysis that the proposed method allows to produce desired editing.
arXiv Detail & Related papers (2024-09-02T15:21:46Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing [2.5602836891933074]
A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image.
Current methods for diffusion inversion often struggle to produce edits that are both faithful to the specified text prompt and closely resemble the source image.
We introduce a novel and adaptable diffusion inversion technique for real image editing, which is grounded in a theoretical analysis of the role of $eta$ in the DDIM sampling equation for enhanced editability.
arXiv Detail & Related papers (2024-03-14T15:07:36Z) - BARET : Balanced Attention based Real image Editing driven by
Target-text Inversion [36.59406959595952]
We propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model.
Our method contains three novelties: (I) Targettext Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence; (II) Progressive Transition Scheme applies progressive linear approaches between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability; (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics
arXiv Detail & Related papers (2023-12-09T07:18:23Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.