Related papers: In-Context Audio Control of Video Diffusion Transformers

In-Context Audio Control of Video Diffusion Transformers

URL: http://arxiv.org/abs/2512.18772v1
Date: Sun, 21 Dec 2025 15:22:28 GMT
Title: In-Context Audio Control of Video Diffusion Transformers
Authors: Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue,
Abstract summary: This paper introduces In-Context Audio Control of video diffusion transformers (ICAC)<n>We investigate the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT.<n>We propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance.
Score: 28.911323185865186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.

Related papers

DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation [77.89090846233906]
We propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM)<n>DCDM decomposes video consistency modeling into three dedicated components while sharing a unified video generation backbone.<n>We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
arXiv Detail & Related papers (2026-02-14T07:02:36Z)
DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation [23.171175300622675]
We propose a unified framework for controllable human-centric audio-video generation.<n>DreamID- Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency.<n>We will release our code to bridge the gap between academic research and commercial-grade applications.
arXiv Detail & Related papers (2026-02-12T16:41:52Z)
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy [39.04292189640444]
Harmony is a novel framework that mechanistically enforces audio-visual synchronization.<n>It establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
arXiv Detail & Related papers (2025-11-26T16:53:05Z)
StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model [73.30619724574642]
Speech-driven 3D facial animation aims to generate realistic and synchronized facial motions driven by speech inputs.<n>Recent methods have employed audio-conditioned diffusion models for 3D facial animation.<n>We propose a novel autoregressive diffusion model that processes audio in a streaming manner.
arXiv Detail & Related papers (2025-11-18T07:55:16Z)
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction [28.20791917022439]
This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation.<n>It aims to generate a video with synchronized audio from text conditions, ensuring both modalities are aligned with text.<n>Two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, and (2) the optimal mechanism for cross-modal feature interaction remains unclear.
arXiv Detail & Related papers (2025-10-03T15:43:56Z)
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts [59.38012380516272]
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video.<n>To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique.
arXiv Detail & Related papers (2025-09-07T17:55:03Z)
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers [25.36460340267922]
We present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos.<n>Our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs.
arXiv Detail & Related papers (2025-06-01T04:27:13Z)
Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation [62.218932509432314]
Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames.<n>We learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation.
arXiv Detail & Related papers (2025-04-08T07:23:28Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis [56.749927786910554]
We propose a novel framework that integrates Gaussian Splatting with a structured Audio Factorization Plane (Audio-Plane) to enable high-quality, audio-synchronized, and real-time talking head generation.<n>Our method achieves state-of-the-art visual quality, precise audio-lip synchronization, and real-time performance, outperforming prior approaches across both 2D- and 3D-based paradigms.
arXiv Detail & Related papers (2025-03-28T16:50:27Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition [29.414663568089292]
Audio-visual speech recognition aims to transcribe human speech using both audio and video modalities. In this study, we strengthen the video features by learning three temporal dynamics in video data. We achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings.
arXiv Detail & Related papers (2024-07-04T01:25:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.