Related papers: Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

URL: http://arxiv.org/abs/2510.01284v1
Date: Tue, 30 Sep 2025 21:03:50 GMT
Title: Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Authors: Chetwin Low, Weimin Wang, Calder Katyal,
Abstract summary: Ovi is a unified paradigm for audio-video generation that models the two modalities as a single generative process.<n>Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects.<n>Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips.
Score: 5.304004483404346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi

Related papers

MOVA: Towards Scalable and Synchronized Video-Audio Generation [91.56945636522345]
We introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content.<n>By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators.
arXiv Detail & Related papers (2026-02-09T15:31:54Z)
ALIVE: Animate Your World with Lifelike Audio-Video Generation [50.693986608051716]
ALIVE is a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation.<n>To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch.<n>ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.
arXiv Detail & Related papers (2026-02-09T14:06:03Z)
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion [47.70095297438178]
We introduce a single-model approach that adapts an audio-video diffusion model for video-to-video dubbing via a lightweight LoRA.<n>We generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half.<n>We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
arXiv Detail & Related papers (2026-01-29T18:57:13Z)
Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation [20.446421146630474]
We introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising.<n>Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony.
arXiv Detail & Related papers (2025-12-02T06:31:38Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z)
OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking [22.337906095079198]
We present OmniTalker, a unified framework that jointly generates synchronized talking audio-video content from input text.<n>Our framework adopts a dual-branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis.
arXiv Detail & Related papers (2025-04-03T09:48:13Z)
ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation [72.22243595269389]
This paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent.<n>The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production.<n>Our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.
arXiv Detail & Related papers (2025-03-10T11:57:55Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.