UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
- URL: http://arxiv.org/abs/2511.03334v1
- Date: Wed, 05 Nov 2025 10:06:51 GMT
- Title: UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
- Authors: Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang,
- Abstract summary: UniAVGen is a unified framework for joint audio and video generation.<n>UniAVGen delivers overall advantages in audiovideo synchronization, timbre, and emotion consistency.
- Score: 34.27531187147479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
Related papers
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z) - Omni2Sound: Towards Unified Video-Text-to-Audio Generation [56.11583645408007]
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) generation offers significant application flexibility.<n>SoundAtlas is a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality.<n>We propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities.
arXiv Detail & Related papers (2026-01-06T05:49:41Z) - Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy [39.04292189640444]
Harmony is a novel framework that mechanistically enforces audio-visual synchronization.<n>It establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
arXiv Detail & Related papers (2025-11-26T16:53:05Z) - ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation [15.636132687296788]
ProAV-DiT is a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation.<n>At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space.<n>Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.
arXiv Detail & Related papers (2025-11-15T07:24:17Z) - Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation [24.799628787198397]
AudioGen- Omni generates high-fidelity audio, speech, and song coherently synchronized with the input video.<n>Joint training paradigm integrates large-scale video-text-audio corpora.<n>Dense frame-level representations are fused using an AdaLN-based joint attention mechanism.<n>With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
arXiv Detail & Related papers (2025-08-01T16:03:57Z) - MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation [21.216297567167036]
MirrorMe is a real-time, controllable framework built on the LTX video model.<n>MirrorMe compresses video spatially and temporally for efficient latent space denoising.<n> experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
arXiv Detail & Related papers (2025-06-27T09:57:23Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [71.90109867684025]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap [38.5017989456818]
DiffGAP is a novel approach incorporating a lightweight generative module within the contrastive space.<n>Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks.
arXiv Detail & Related papers (2025-03-15T13:24:09Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
We propose a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation.<n>The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.