DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
- URL: http://arxiv.org/abs/2602.12160v1
- Date: Thu, 12 Feb 2026 16:41:52 GMT
- Title: DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
- Authors: Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou,
- Abstract summary: We propose a unified framework for controllable human-centric audio-video generation.<n>DreamID- Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency.<n>We will release our code to bridge the gap between academic research and commercial-grade applications.
- Score: 23.171175300622675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
Related papers
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z) - GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining [64.72014392166625]
GMS-CAVP is a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives.<n>First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities.<n>Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio.
arXiv Detail & Related papers (2026-01-27T13:43:32Z) - Omni2Sound: Towards Unified Video-Text-to-Audio Generation [56.11583645408007]
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) generation offers significant application flexibility.<n>SoundAtlas is a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality.<n>We propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities.
arXiv Detail & Related papers (2026-01-06T05:49:41Z) - In-Context Audio Control of Video Diffusion Transformers [28.911323185865186]
This paper introduces In-Context Audio Control of video diffusion transformers (ICAC)<n>We investigate the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT.<n>We propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance.
arXiv Detail & Related papers (2025-12-21T15:22:28Z) - JoVA: Unified Multimodal Learning for Joint Video-Audio Generation [23.0536211998086]
We present JoVA, a unified framework for joint video-audio generation.<n>To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer.<n>To enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection.
arXiv Detail & Related papers (2025-12-15T18:58:18Z) - UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions [34.27531187147479]
UniAVGen is a unified framework for joint audio and video generation.<n>UniAVGen delivers overall advantages in audiovideo synchronization, timbre, and emotion consistency.
arXiv Detail & Related papers (2025-11-05T10:06:51Z) - StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z) - HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning [33.868900473146496]
We present HuMo, a framework for collaborative multimodal control.<n>HuMo surpasses specialized state-of-the-art methods in sub-tasks.
arXiv Detail & Related papers (2025-09-10T11:54:29Z) - InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z) - AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.