Fugu-MT 論文翻訳(概要): An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

論文の概要: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

arxiv url: http://arxiv.org/abs/2508.17435v1
Date: Sun, 24 Aug 2025 16:28:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.521839
Title: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing
Title（参考訳）: LLM-LVLM-Driven Agent for Iterative and Fine-Grained Image Editing
Authors: Zihan Liang, Jiahao Sun, Haoran Ma,
Abstract要約: RefineEdit-Agentは、複雑で反復的でコンテキスト対応の画像編集のための、新しい、トレーニング不要なインテリジェントエージェントフレームワークである。我々のフレームワークは、LVI駆動のインストラクションとシーン理解モジュール、多レベル編集プランナー、反復画像編集モジュール、LVLM駆動のフィードバックと評価ループから構成されている。
参考スコア（独自算出の注目度）: 5.192553173010677
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.
Abstract（参考訳）: T2I(text-to-image)生成モデルの驚くべき機能にもかかわらず、現実世界のアプリケーションは、既存のメソッドが提供に苦慮している、細粒で反復的な画像編集を必要とすることが多い。主な課題は、きめ細かい指示理解、修正中の堅牢なコンテキスト保存、反復的改善のための知的フィードバック機構の欠如である。 RefineEdit-Agentは、複雑な、反復的で、コンテキスト対応の画像編集を可能にすることで、これらの制限に対処するために設計された、新しい、トレーニング不要なインテリジェントエージェントフレームワークである。 RefineEdit-Agentは、大規模言語モデル(LLM)の強力な計画能力と、閉ループシステム内の視覚言語大モデル(LVLM)の高度な視覚的理解と評価技術を活用する。我々のフレームワークは、LVLM駆動の命令パーサとシーン理解モジュール、ゴール分解のための多レベルLLM駆動の編集プランナ、ツールの選択とシーケンス生成、反復的な画像編集モジュール、LVLM駆動のフィードバックと評価ループからなる。 RefineEdit-Agentを厳格に評価するために、LongBench-T2I-Editを提案する。 RefineEdit-Agentは、LongBench-T2I-Editの平均スコアが2.29、InstructPix2Pixが2.91、GLIGENベースのEditが3.16、ControlNet-XLが3.39であるのに対し、RefineEdit-Agentは最先端のベースラインを大きく上回っている。編集精度と文脈保存性に優れたエージェント設計の有効性を更に検証するため, アブレーション研究, 人的評価, 反復的洗練, バックボーン選択, ツール使用量, 堅牢性の検証を行った。

論文の概要: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

関連論文リスト