Related papers: Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

URL: http://arxiv.org/abs/2510.16888v3
Date: Tue, 04 Nov 2025 13:15:36 GMT
Title: Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Authors: Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan,
Abstract summary: We introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization.<n>We employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback.<n>Our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models.
Score: 41.41713036839503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.

Related papers

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward [64.78078130943489]
We introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a reward model.<n>We show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems.
arXiv Detail & Related papers (2026-02-19T17:11:59Z)
FAIL: Flow Matching Adversarial Imitation Learning for Image Generation [52.643484089126844]
Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to Imitation learning.<n>We propose Flow Matching Adrial Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons.
arXiv Detail & Related papers (2026-02-12T16:36:33Z)
Learning an Image Editing Model without Image Editing Pairs [83.03646586929638]
Recent image editing models have achieved impressive results while following natural language editing instructions.<n>They rely on supervised fine-tuning with large datasets of input-target pairs.<n>Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models.<n>We present a new training paradigm that eliminates the need for paired data entirely.
arXiv Detail & Related papers (2025-10-16T17:59:57Z)
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling [71.8265422228785]
Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been hindered by the lack of a high-fidelity, efficient reward signal.<n>We present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model.
arXiv Detail & Related papers (2025-09-28T14:28:24Z)
Visual Autoregressive Modeling for Instruction-Guided Image Editing [97.04821896251681]
We present a visual autoregressive framework that reframes image editing as a next-scale prediction problem.<n>VarEdit generates multi-scale target features to achieve precise edits.<n>It completes a $512times512$ editing in 1.2 seconds, making it 2.2$times$ faster than the similarly sized UltraEdit.
arXiv Detail & Related papers (2025-08-21T17:59:32Z)
Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback [33.127607245587576]
We introduce a framework that enables MLLMs to learn complex visual reasoning from only raw images.<n>We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning.<n>The RRVF-trained model not only outperforms existing MLLMs and supervised fine-tuning baselines but also exhibits superior generalization.
arXiv Detail & Related papers (2025-07-28T12:21:19Z)
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models [48.15777554876988]
Traditional alignment methods often require retraining large pretrained models.<n>We propose a novel textitResidual Alignment Model (textitRAM) that formalizes the alignment process as a type of importance sampling.<n>We develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods.
arXiv Detail & Related papers (2025-05-26T08:53:02Z)
Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback to improve model alignment. We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z)
One-Step Image Translation with Text-to-Image Models [35.0987002313882]
We introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. We consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights. Our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks.
arXiv Detail & Related papers (2024-03-18T17:59:40Z)
Diffusion Model-Based Image Editing: A Survey [46.244266782108234]
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks.<n>We provide an exhaustive overview of existing methods using diffusion models for image editing.<n>To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval.
arXiv Detail & Related papers (2024-02-27T14:07:09Z)
A Unified Conditional Framework for Diffusion-based Image Restoration [39.418415473235235]
We present a unified conditional framework based on diffusion models for image restoration. We leverage a lightweight UNet to predict initial guidance and the diffusion model to learn the residual of the guidance. To handle high-resolution images, we propose a simple yet effective inter-step patch-splitting strategy.
arXiv Detail & Related papers (2023-05-31T17:22:24Z)
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models [97.31200133440308]
We propose using online reinforcement learning to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem. Our approach, coined DPOK, integrates policy optimization with KL regularization.
arXiv Detail & Related papers (2023-05-25T17:35:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.