Related papers: ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

URL: http://arxiv.org/abs/2512.21268v1
Date: Wed, 24 Dec 2025 16:24:18 GMT
Title: ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang,
Abstract summary: We propose Attention-Conditional Diffusion, a framework for direct conditional control in video diffusion models via attention supervision.<n>ACD achieves better controllability by aligning the model's attention maps with external control signals.<n>Experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs.
Score: 62.41380823195191
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

Related papers

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z)
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z)
$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion [65.77755100137728]
We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens.<n>E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
arXiv Detail & Related papers (2025-11-26T16:14:20Z)
ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z)
DivControl: Knowledge Diversion for Controllable Image Generation [38.166949036830886]
DivControl is a decomposable pretraining framework for unified controllable generation.<n>It achieves state-of-the-art controllability with 36.4$times$ less training cost.<n>It also delivers strong zero-shot and few-shot performance on unseen conditions.
arXiv Detail & Related papers (2025-07-31T15:00:15Z)
SCALAR: Scale-wise Controllable Visual Autoregressive Learning [15.775596699630633]
We present SCALAR, a controllable generation method based on Visual Autoregressive ( VAR)<n>We leverage a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone.<n>Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model.
arXiv Detail & Related papers (2025-07-26T13:23:08Z)
Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models.<n> Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z)
Constraint Guided AutoEncoders for Joint Optimization of Condition Indicator Estimation and Anomaly Detection in Machine Condition Monitoring [0.0]
This work proposes an extension to Constraint Guided AutoEncoders (CGAE) that enables building a single model that can be used for both AD and CI estimation. For the purpose of improved CI estimation the extension incorporates a constraint that enforces the model to have monotonically increasing CI predictions over time. Experimental results indicate that the proposed algorithm performs similar, or slightly better, than CGAE, with regards to AD, while improving the monotonic behavior of the CI.
arXiv Detail & Related papers (2024-09-18T08:48:54Z)
ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs) Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z)
ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models. Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z)
Fine-grained Controllable Video Generation via Object Appearance and Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.