Related papers: Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

URL: http://arxiv.org/abs/2511.21579v2
Date: Fri, 28 Nov 2025 12:25:17 GMT
Title: Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Authors: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi,
Abstract summary: Harmony is a novel framework that mechanistically enforces audio-visual synchronization.<n>It establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
Score: 39.04292189640444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

Related papers

In-Context Audio Control of Video Diffusion Transformers [28.911323185865186]
This paper introduces In-Context Audio Control of video diffusion transformers (ICAC)<n>We investigate the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT.<n>We propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance.
arXiv Detail & Related papers (2025-12-21T15:22:28Z)
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions [34.27531187147479]
UniAVGen is a unified framework for joint audio and video generation.<n>UniAVGen delivers overall advantages in audiovideo synchronization, timbre, and emotion consistency.
arXiv Detail & Related papers (2025-11-05T10:06:51Z)
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction [28.20791917022439]
This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation.<n>It aims to generate a video with synchronized audio from text conditions, ensuring both modalities are aligned with text.<n>Two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, and (2) the optimal mechanism for cross-modal feature interaction remains unclear.
arXiv Detail & Related papers (2025-10-03T15:43:56Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation [21.216297567167036]
MirrorMe is a real-time, controllable framework built on the LTX video model.<n>MirrorMe compresses video spatially and temporally for efficient latent space denoising.<n> experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
arXiv Detail & Related papers (2025-06-27T09:57:23Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [71.90109867684025]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling [81.37449968164692]
We propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video.<n>Our approach combines two complementary sampling strategies, which ensure seamless local transitions and enforce global coherence.<n>Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence.
arXiv Detail & Related papers (2025-03-11T16:43:45Z)
Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.