UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
- URL: http://arxiv.org/abs/2508.03142v1
- Date: Tue, 05 Aug 2025 06:42:09 GMT
- Title: UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
- Authors: Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang,
- Abstract summary: We introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability.<n>We implement our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.
- Score: 64.5307229755533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, unified vision-language models (VLMs) have rapidly advanced, effectively tackling both visual understanding and generation tasks within a single design. While many unified VLMs have explored various design choices, the recent hypothesis from OpenAI's GPT-4o suggests a promising generation pipeline: Understanding VLM->Visual Feature->Projector->Diffusion Model->Image. The understanding VLM is frozen, and only the generation-related modules are trained. This pipeline maintains the strong capability of understanding VLM while enabling the image generation ability of the unified VLM. Although this pipeline has shown very promising potential for the future development of unified VLM, how to easily enable image editing capability is still unexplored. In this paper, we introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability via three iterative steps: understanding, editing, and verifying. 1. The understanding step analyzes the source image to create a source prompt through structured semantic analysis and makes minimal word replacements to form the target prompt based on the editing instruction. 2. The editing step introduces a time-adaptive offset, allowing for coherent editing from coarse to fine throughout the denoising process. 3. The verification step checks the alignment between the target prompt and the intermediate edited image, provides automatic consistency scores and corrective feedback, and determines whether to stop early or continue the editing loop. This understanding, editing, and verifying loop iterates until convergence, delivering high-fidelity editing in a training-free manner. We implemented our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.
Related papers
- DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models [26.762431651154607]
We propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers.<n>We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines.
arXiv Detail & Related papers (2025-06-16T16:04:16Z) - MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection [13.467269066605452]
We propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM.<n> MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent.<n>Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.
arXiv Detail & Related papers (2025-05-25T13:54:31Z) - SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing [25.8179737362091]
Existing datasets are typically constructed using various automated methods, leading to noisy supervision signals.<n>Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue.<n>In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs.
arXiv Detail & Related papers (2025-05-05T05:19:40Z) - ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.<n>ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.<n>We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z) - FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.<n>FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.<n>Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts [17.376346967267327]
We propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and Vision LLM editing.<n>A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby eliminating visually irrelevant experts for input queries.<n>To integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion.
arXiv Detail & Related papers (2024-11-23T03:19:40Z) - Real-time 3D-aware Portrait Editing from a Single Image [111.27169315556444]
3DPE can edit a face image following given prompts, like reference images or text descriptions.
A lightweight module is distilled from a 3D portrait generator and a text-to-image model.
arXiv Detail & Related papers (2024-02-21T18:36:26Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z) - EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing.
We show that EditGAN can manipulate images with an unprecedented level of detail and freedom.
We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.