FuguReport

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

Authors Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu
Affiliations Apple / The University of Texas at Austin / Lambda, Inc / University of California, Los Angeles
Categories Method / Reinforcement Learning / Reward signal optimization with flow matching, Application / Image Editing / Multi-turn image editing framework, Evaluation / Model Evaluation / Performance improvement across various base models
License CC BY 4.0

Abstract Overview

MT-EditFlow is a reinforcement learning framework for multi-turn image editing built on flow-matching models. The paper argues that open-source image editors trained mainly for single-turn edits degrade in sequential settings because one failed step can ruin the whole sequence and errors propagate across turns. To address this, the method combines a multi-turn formulation with two reward components (instruction following and content consistency), and studies how reward aggregation, evaluator prompting mode, and fusion strategy affect training. The framework is designed to work with both GRPO- and DiffusionNFT-style reinforcement learning methods, utilizing trajectory-level advantage broadcasting to align local edits with overall multi-turn success.

Novelty

The paper's main novelty is a unified reward-signal design for multi-turn image editing under flow-matching reinforcement learning, rather than the usual single-turn, single-reward setup. It also introduces and analyzes specific design choices for this setting, including multi-turn reward aggregation, advantage-level fusion of instruction-following and content-consistency signals, and trajectory-level advantage broadcasting.

Results

On EdiVal-Bench, MT-EditFlow improves FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance and FLUX.2-klein-base-9B by 2.90 points, with gains especially pronounced at later turns. The reported FLUX.1-Kontext-dev result also exceeds the open-source Qwen-Image-Edit baseline on turn-3 overall score. The method additionally yields modest single-turn gains on ImgEdit-Bench and shows flatter success decay across turns, indicating reduced exposure bias.

Key Points

  1. MT-EditFlow extends flow-matching RL to sequential image editing by optimizing both instruction following and content consistency over multi-turn trajectories.
  2. The paper finds that fine-grained per-turn scoring, thinking-mode VLM evaluation, and advantage-level fusion provide more effective reward signals than sparser or less normalized alternatives.
  3. Experiments show stronger multi-turn robustness on open-source backbones, with larger improvements at later turns and evidence of reduced error propagation across the editing chain.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.