Related papers: DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

URL: http://arxiv.org/abs/2602.12160v1
Date: Thu, 12 Feb 2026 16:41:52 GMT
Title: DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Authors: Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou,
Abstract summary: We propose a unified framework for controllable human-centric audio-video generation.<n>DreamID- Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency.<n>We will release our code to bridge the gap between academic research and commercial-grade applications.
Score: 23.171175300622675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

Related papers

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z)
GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining [64.72014392166625]
GMS-CAVP is a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives.<n>First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities.<n>Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio.
arXiv Detail & Related papers (2026-01-27T13:43:32Z)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation [56.11583645408007]
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) generation offers significant application flexibility.<n>SoundAtlas is a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality.<n>We propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities.
arXiv Detail & Related papers (2026-01-06T05:49:41Z)
In-Context Audio Control of Video Diffusion Transformers [28.911323185865186]
This paper introduces In-Context Audio Control of video diffusion transformers (ICAC)<n>We investigate the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT.<n>We propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance.
arXiv Detail & Related papers (2025-12-21T15:22:28Z)
JoVA: Unified Multimodal Learning for Joint Video-Audio Generation [23.0536211998086]
We present JoVA, a unified framework for joint video-audio generation.<n>To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer.<n>To enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection.
arXiv Detail & Related papers (2025-12-15T18:58:18Z)
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions [34.27531187147479]
UniAVGen is a unified framework for joint audio and video generation.<n>UniAVGen delivers overall advantages in audiovideo synchronization, timbre, and emotion consistency.
arXiv Detail & Related papers (2025-11-05T10:06:51Z)
StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z)
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning [33.868900473146496]
We present HuMo, a framework for collaborative multimodal control.<n>HuMo surpasses specialized state-of-the-art methods in sub-tasks.
arXiv Detail & Related papers (2025-09-10T11:54:29Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.