Fugu-MT 論文翻訳(概要): InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

論文の概要: InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

arxiv url: http://arxiv.org/abs/2603.01586v1
Date: Mon, 02 Mar 2026 08:13:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-03 19:50:56.757264
Title: InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
Title（参考訳）: InterCoG:Interleaved Chain-of-Ground Reasoningによる空間的精密画像編集を目指して
Authors: Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo,
Abstract要約: 複雑な現実世界のシーンにおける微細な画像編集のためのテキストビジョンインターリーブド・チェーン・オブ・グラウンド推論フレームワークを提案する。 InterCoGの重要な洞察は、まずテキスト内でのみオブジェクト位置推論を実行することである。また,マルチモーダル・グラウンド・ライティング・アライメント・アライメントとマルチモーダル・グラウンド・ライティング・アライメント・アライメントの2つの補助的トレーニング・モジュールを提案する。
参考スコア（独自算出の注目度）: 60.799998743918955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
Abstract（参考訳）: 統合編集モデルの創発は、一般的なオブジェクト編集タスクにおいて強力な機能を示している。しかし、複雑な多目的シーン、特にターゲットが視覚的に健全ではなく空間的推論を必要とするシーンにおいて、きめ細かい編集を行うことは依然として重要な課題である。そこで本研究では,複雑な実世界のシーンにおける微細な画像編集のためのテキストビジョンインターリーブド・チェーン・オブ・グラウンド推論フレームワークであるInterCoGを提案する。 InterCoGの重要な洞察は、まず、空間的関係の詳細を含むテキスト内でのみオブジェクト位置推論を行い、編集対象の位置とアイデンティティを明示的に推論することである。そして、生成したバウンディングボックスとマスクで編集対象を強調表示し、最終的に編集記述を書き換えて、意図した結果を指定する。このパラダイムをさらに促進するために、空間的局所化精度と推論可能性を向上させるために、マルチモーダルグラウンド化再構築監督とマルチモーダルグラウンド化推論アライメントという2つの補助的トレーニングモジュールを提案する。また,グラウンディング・アウェア・編集評価のためのグラウンド編集-45Kと,詳細な推論アノテーションを用いた45Kグラウンド編集-45Kのデータセットを構築した。広汎な実験は、空間的複雑で多義的なシーン下での高精度な編集において、我々のアプローチの優位性を裏付けるものである。

論文の概要: InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

関連論文リスト