Related papers: Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

URL: http://arxiv.org/abs/2602.18022v2
Date: Wed, 25 Feb 2026 15:33:35 GMT
Title: Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
Authors: Guandong Li,
Abstract summary: Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing.<n>We propose Dual-Channel Attention Guidance (DCAG) to simultaneously manipulate both the Key channel and the Value channel.<n>DCAG consistently outperforms Key-only guidance across all fidelity metrics.
Score: 11.772150619675527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).

Related papers

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision [62.41380823195191]
We propose Attention-Conditional Diffusion, a framework for direct conditional control in video diffusion models via attention supervision.<n>ACD achieves better controllability by aligning the model's attention maps with external control signals.<n>Experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs.
arXiv Detail & Related papers (2025-12-24T16:24:18Z)
Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity [35.95129874095729]
Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions.<n>We introduce the first theoretical framework with principled optimizable objective for steering sampling dynamics toward multi-subject fidelity.
arXiv Detail & Related papers (2025-10-02T17:59:58Z)
Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation [8.912201177914858]
Saliency-Motion guided Trunk-Collateral Network (SMTC-Net)<n>We propose a novel Trunk-Collateral structure for motion-appearance video object segmentation (UVOS)<n>SMTC-Net achieved state-of-the-art performance on three UVOS datasets.
arXiv Detail & Related papers (2025-04-08T11:02:14Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers [11.003945673813488]
Diffusion Transformer plays pivotal role in advancing text-to-image and text-to-video generation.<n>We propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl.<n>Our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.
arXiv Detail & Related papers (2025-02-20T09:10:05Z)
Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy [16.62770246342126]
Deep learning methods have demonstrated high accuracy and robustness in wire segmentation.<n>These methods require substantial datasets for generalizability.<n>We propose the Frame-consistency Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos.
arXiv Detail & Related papers (2024-12-20T16:52:11Z)
Joint Channel Estimation and Feedback with Masked Token Transformers in Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix. The entire encoder-decoder network is utilized for channel compression. Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z)
ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers. We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z)
Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
Operation-Aware Soft Channel Pruning using Differentiable Masks [51.04085547997066]
We propose a data-driven algorithm, which compresses deep neural networks in a differentiable way by exploiting the characteristics of operations. We perform extensive experiments and achieve outstanding performance in terms of the accuracy of output networks.
arXiv Detail & Related papers (2020-07-08T07:44:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.