Fugu-MT 論文翻訳(概要): IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

論文の概要: IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

arxiv url: http://arxiv.org/abs/2603.29602v1
Date: Thu, 12 Feb 2026 02:37:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.149194
Title: IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection
Title（参考訳）: IMAGAgent:制約を考慮した計画とリフレクションによるマルチターン画像編集のオーケストレーション
Authors: Fei Shen, Chengyu Xie, Lihong Wang, Zhanyi Zhang, Xin Jiang, Xiaoyu Du, Jinhui Tang,
Abstract要約: IMAGAgentは、"plan-execute-reflect"クローズドループメカニズムに基づいたマルチターン画像編集エージェントフレームワークである。命令解析、ツールスケジューリング、および統一パイプライン内の適応補正の深いシナジーを実現する。構築した textbfMTEditBench と MagicBrush データセットによる実験により,IMAGAgent が既存の手法よりもはるかに優れた性能を発揮することが示された。
参考スコア（独自算出の注目度）: 40.21337735524356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose \textbf{IMAGAgent}, a multi-turn image editing agent framework based on a "plan-execute-reflect" closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline. Specifically, we first present a constraint-aware planning module that leverages a vision-language model (VLM) to precisely decompose complex natural language instructions into a series of executable sub-tasks, governed by target singularity, semantic atomicity, and visual perceptibility. Then, the tool-chain orchestration module dynamically constructs execution paths based on the current image, the current sub-task, and the historical context, enabling adaptive scheduling and collaborative operation among heterogeneous operation models covering image retrieval, segmentation, detection, and editing. Finally, we devise a multi-expert collaborative reflection mechanism where a central large language model (LLM) receives the image to be edited and synthesizes VLM critiques into holistic feedback, simultaneously triggering fine-grained self-correction and recording feedback outcomes to optimize future decisions. Extensive experiments on our constructed \textbf{MTEditBench} and the MagicBrush dataset demonstrate that IMAGAgent achieves performance significantly superior to existing methods in terms of instruction consistency, editing precision, and overall quality. The code is available at https://github.com/hackermmzz/IMAGAgent.git.
Abstract（参考訳）: 既存のマルチターン画像編集パラダイムは、孤立した単一ステップの実行に制限されることが多い。コンテキスト認識と閉ループフィードバック機構が欠如しているため、マルチターン相互作用中にエラーの蓄積とセマンティックドリフトが生じやすいため、最終的に生成した画像の重大な構造的歪みが生じる。そこで我々は, 命令解析, ツールスケジューリング, 適応補正の深い相乗効果を実現する, "plan-execute-reflect" クローズドループ機構に基づくマルチターン画像編集エージェントフレームワークである \textbf{IMAGAgent} を提案する。具体的には、まず、視覚言語モデル(VLM)を利用して、複雑な自然言語命令をターゲット特異性、セマンティックアトミック性、視覚的知覚性によって制御された一連の実行可能なサブタスクに正確に分解する制約対応計画モジュールを提案する。そして、ツールチェーンオーケストレーションモジュールは、現在の画像、現在のサブタスク、履歴コンテキストに基づいて実行パスを動的に構築し、画像検索、セグメンテーション、検出、編集を含む異種操作モデル間の適応的なスケジューリングおよび協調操作を可能にする。最後に、中央の大規模言語モデル(LLM)が編集対象の画像を受信し、VLM批評を総合的なフィードバックに合成し、同時に微粒な自己補正とフィードバック結果の記録をトリガーし、将来の意思決定を最適化するマルチエキスパート協調反射機構を考案する。構築した \textbf{MTEditBench} と MagicBrush データセットの大規模な実験により、IMAGAgent は命令整合性、編集精度、全体的な品質において、既存の手法よりもはるかに優れた性能を実現していることが示された。コードはhttps://github.com/hackermmzz/IMAGAgent.gitで公開されている。

論文の概要: IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection

関連論文リスト