OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
- URL: http://arxiv.org/abs/2602.12304v1
- Date: Thu, 12 Feb 2026 03:25:41 GMT
- Title: OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
- Authors: Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu,
- Abstract summary: We propose sync audio-video customization, which aims to synchronously customize both video identity and audio timbre.<n>This novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio.<n>To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework.
- Score: 37.45649985771698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.
Related papers
- ALIVE: Animate Your World with Lifelike Audio-Video Generation [50.693986608051716]
ALIVE is a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation.<n>To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch.<n>ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.
arXiv Detail & Related papers (2026-02-09T14:06:03Z) - VABench: A Comprehensive Benchmark for Audio-Video Generation [22.00633729850902]
VABench is a benchmark framework designed to evaluate the capabilities of synchronous audio-video generation.<n>It covers three task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation.<n>VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds.
arXiv Detail & Related papers (2025-12-10T03:57:29Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z) - HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation [10.037480577373161]
HunyuanCustom is a customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions.<n>Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation.
arXiv Detail & Related papers (2025-05-07T15:33:18Z) - Identity-Preserving Video Dubbing Using Motion Warping [26.10803670509977]
Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal.<n>We propose IPTalker, a framework for video dubbing that achieves seamless alignment between driving audio and reference identity.<n>IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention.
arXiv Detail & Related papers (2025-01-08T16:06:21Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - Synthesizing Audio from Silent Video using Sequence to Sequence Modeling [0.0]
We propose a novel method to generate audio from video using a sequence-to-sequence model.
Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures.
Our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
arXiv Detail & Related papers (2024-04-25T22:19:42Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.