Fugu-MT 論文翻訳(概要): Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

論文の概要: Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

arxiv url: http://arxiv.org/abs/2509.01986v1
Date: Tue, 02 Sep 2025 06:06:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.921162
Title: Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination
Title（参考訳）: ドロー・イン・ミンド:チェーン・オブ・サード・イマジネーションによる精密画像編集の学習
Authors: Ziyun Zeng, Junhao Zhang, Wei Li, Mike Zheng Shou,
Abstract要約: DIM-T2I(Draw-In-Mind:Draw-In-Mind:DIM)と、GPT-4oが生成した233Kのチェーン・オブ・シンジケーションからなるDIM-Edit(DIM-Edit)という2つの補完的なサブセットからなるデータセットを紹介し、画像編集のための明示的な設計青写真として機能する。 DIM-4.6B-T2I/Edit は ImgEdit や GEdit-Bench のベンチマークにおいて、UniWorld-V1 や Step1X-Edit など、はるかに大きなモデルよりも優れたパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 53.197392152109636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be available at https://github.com/showlab/DIM.
Abstract（参考訳）: 近年,マルチモーダル理解と生成を単一の統一モデルに統合することが,有望なパラダイムとして浮上している。このアプローチはテキスト・ツー・イメージ(T2I)生成において強い結果をもたらすが、正確な画像編集には依然として苦戦している。我々はこの制限を不均衡な責任の分割とみなす。理解モジュールは主にユーザ命令を意味のある条件にエンコードするトランスレータとして機能し、生成モジュールはデザイナと画家を同時に動作させ、元のレイアウトを推論し、ターゲットの編集領域を特定し、新しいコンテンツをレンダリングする必要がある。理解モジュールは一般的に、生成モジュールよりも複雑な推論タスクに関する数倍のデータで訓練されるため、この不均衡は直感的ではない。この問題に対処するために、Draw-In-Mind (DIM) という2つの補足集合からなるデータセットを紹介します。 (i)DIM-T2Iは、複雑な命令理解を高めるために、14Mの長文画像-テキストペアを含む。 (ii)DIM-Editは、GPT-4oによって生成される233Kチェーンの想像力で構成され、画像編集のための明示的な設計青写真として機能する。凍結したQwen2.5-VL-3BとSANA1.5-1.6Bを軽量な2層MLPで接続し、提案したDIMデータセットでトレーニングし、結果としてDIM-4.6B-T2I/Editとなる。控えめなパラメータスケールにもかかわらず、DIM-4.6B-Edit は ImgEdit や GEdit-Bench のベンチマークで SOTA または競合的な性能を達成し、UniWorld-V1 や Step1X-Edit など、はるかに大きなモデルよりも優れている。これらの結果から,設計責任を理解モジュールに明示的に割り当てることが,画像編集に有益であることが示唆された。私たちのデータセットとモデルは、https://github.com/showlab/DIM.orgで公開されます。

論文の概要: Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

関連論文リスト