Fugu-MT 論文翻訳(概要): Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

論文の概要: Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

arxiv url: http://arxiv.org/abs/2606.05950v1
Date: Thu, 04 Jun 2026 09:49:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-06 06:55:34.649868
Title: Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing
Title（参考訳）: Edit-R2:マルチスレッド画像編集のためのコンテキスト認識強化学習
Authors: Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan,
Abstract要約: 統一マルチモーダルモデルのための新しい強化学習フレームワークであるEdit-R2を紹介する。作業中のセッション意図を再構築し、散在する歴史的制約を編集の各ターンの前に明確な推論トレースに効果的に統合する。強靭なベースラインに比べて競争力がある。
参考スコア（独自算出の注目度）: 42.176441824728066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.
Abstract（参考訳）: テキスト誘導画像編集は拡散モデルと統合マルチモーダル基礎モデルにより急速に進歩した。しかし、既存のほとんどのメソッドはシングルターン設定に限られており、マルチターンインテキスト編集のより現実的なシナリオを見落としている。この設定では、モデルは、蓄積されたセッションレベルの制約を保ちながら、それぞれ新しい命令に従う必要がある。長文の希釈(long-context dilution) — インターリーブされた画像テキスト履歴の増大から、疎文の制約が回復しにくくなり、初期編集ミスがその後の世代で減少する状態汚染(state contamination) — である。統合マルチモーダルモデルのための新しい強化学習後学習フレームワークであるEdit-R2を紹介する。 Edit-R2は操作セッションインテントを再構築し、散在する履歴制約を各編集ターンの前に明示的な推論トレースに効果的に統合する。さらに、離散テキスト空間における意図再構成生成と連続潜在空間におけるフローマッチング画像生成とを協調的に最適化する統一目的により、推論と生成の両面でのマルチターンRLを可能にする一方、軌道フィルタリング機構は、破損したロールアウトを抑制し、状態汚染下でのトレーニングを安定化させる。 MICE-Benchは,命令追従(IF),コンテンツ整合性(CC),セッションの蓄積制約に対するグローバルな認識(GA)のための自動メトリクスを備えた,マルチターンインコンテキスト編集のための大規模ベンチマークである。実験により,Edit-R2はテキスト中のマルチターン編集を大幅に改善し,強力なベースラインと比較して競争性能が向上することが示された。

論文の概要: Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

関連論文リスト