Fugu-MT 論文翻訳(概要): Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

論文の概要: Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

arxiv url: http://arxiv.org/abs/2602.18022v1
Date: Fri, 20 Feb 2026 06:24:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-23 18:01:41.250908
Title: Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
Title（参考訳）: 拡散変換器の学習自由画像編集制御のためのデュアルチャネル注意誘導
Authors: Guandong Li, Mengxia Ye,
Abstract要約: 既存のアテンション操作手法は、アテンションルーティングを変調するキー空間のみにフォーカスする。本稿では,キーチャネルとバリューチャネルの両方を同時に操作するためのDual-Channel Attention Guidance (DCAG)を提案する。 DCAGは、すべての忠実度指標でキーのみのガイダンスを一貫して上回る。
参考スコア（独自算出の注目度）: 10.474377498273205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
Abstract（参考訳）: ディフュージョントランスフォーマ(Diffusion Transformer, DiT)アーキテクチャ上に構築された拡散ベースの画像編集モデルにおいて, 編集強度に対するトレーニング不要な制御が重要な要件である。既存のアテンション操作メソッドは、アテンションルーティングを変調するキースペースのみに重点を置いており、バリュースペース -- フィーチャーアグリゲーションを管理する -- は完全に公開されていない。本稿では,DiTのマルチモーダルアテンション層におけるキーとバリューのプロジェクションが明らかにバイアスデルタ構造を示し,トークンの埋め込みは層固有のバイアスベクトルの周囲に密着することを示した。この観測に基づいて、キーチャネル(出席先を制御する)とバリューチャネル(集約先を制御する)を同時に操作するトレーニングフリーフレームワークであるDual-Channel Attention Guidance (DCAG)を提案する。本稿では,キーチャネルが非線形ソフトマックス関数を介して動作し,粗い制御ノブとして機能し,バリューチャネルが線形重み付け和を通じて動作し,微細な補体として機能することを示す理論的解析を行う。 2次元パラメータ空間 $(δ_k, δ_v)$ は、任意の単一チャネル法よりも正確な編集-忠実トレードオフを可能にする。 PIE-Benchベンチマーク(700のイメージ、10の編集カテゴリ)の大規模な実験では、DCAGはすべての忠実度指標でキーのみのガイダンスよりも優れており、オブジェクト削除(4.9%のLPIPS削減)やオブジェクトの追加(3.2%のLPIPS削減)といった局所的な編集タスクにおいて最も顕著な改善が見られた。

論文の概要: Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

関連論文リスト