Fugu-MT 論文翻訳(概要): Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

論文の概要: Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

arxiv url: http://arxiv.org/abs/2508.09131v1
Date: Tue, 12 Aug 2025 17:57:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.538011
Title: Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Title（参考訳）: マルチモード拡散変換器を用いた学習自由テキストガイドカラー編集
Authors: Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Heung-Yeung Shum,
Abstract要約: トレーニング不要なカラー編集手法であるColorCtrlを提案する。注目マップと値トークンを対象とする操作によって構造と色を分離することにより、精度と一貫性のある色編集を可能にする。本手法は,FLUX.1 Kontext Max や GPT-4o Image Generation などの強力な商用モデルを上回る一貫性を持つ。
参考スコア（独自算出の注目度）: 39.69251226828484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.
Abstract（参考訳）: 画像やビデオのテキスト誘導色編集は、基本的な未解決の問題であり、アルベド、光源色、周囲の照明など、微妙な色属性の操作が必要であり、幾何学、材料特性、光-物質相互作用の物理的整合性は保たれている。既存のトレーニングフリーな手法は、編集タスクにまたがる幅広い適用性を提供するが、正確な色制御に苦慮し、しばしば編集された領域と非編集された領域の両方で視覚的不整合をもたらす。そこで本研究では,現代多モード拡散変換器(MM-DiT)の注意機構を活用する,トレーニング不要なカラー編集手法であるColorCtrlを提案する。注目マップと値トークンを対象とする操作によって構造と色を分離することにより,属性強度の単語レベル制御とともに,正確で一貫した色編集を可能にする。提案手法は,プロンプトによって指定された意図された領域のみを修正し,無関係な領域を未対応にしておく。 SD3とFLUX.1-devの広範な実験により、ColorCtrlは既存のトレーニング不要のアプローチよりも優れており、編集品質と一貫性の両方において最先端のパフォーマンスを実現している。さらに,FLUX.1のような強力な商用モデルを超えている。 Kontext Max と GPT-4o Image Generation の一貫性。 CogVideoXのようなビデオモデルに拡張すると、特に時間的コヒーレンスと編集安定性の維持において、我々のアプローチは大きなアドバンテージを示す。最後に、本手法は、Step1X-EditやFLUX.1といった命令ベースの編集拡散モデルにも一般化される。 Kontext Devは、その汎用性をさらに実証する。

論文の概要: Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

関連論文リスト