Related papers: JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

URL: http://arxiv.org/abs/2601.22143v1
Date: Thu, 29 Jan 2026 18:57:13 GMT
Title: JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
Authors: Anthony Chen, Naomi Ken Korem, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or,
Abstract summary: We introduce a single-model approach that adapts an audio-video diffusion model for video-to-video dubbing via a lightweight LoRA.<n>We generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half.<n>We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
Score: 47.70095297438178
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

Related papers

ALIVE: Animate Your World with Lifelike Audio-Video Generation [50.693986608051716]
ALIVE is a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation.<n>To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch.<n>ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.
arXiv Detail & Related papers (2026-02-09T14:06:03Z)
FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes [56.534404169212785]
FunCineForge comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.<n>We construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data.<n>Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following.
arXiv Detail & Related papers (2026-01-21T08:57:00Z)
LTX-2: Efficient Joint Audio-Visual Foundation Model [3.1804093402153506]
LTX-2 is an open-source model capable of generating temporally synchronized audiovisual content.<n>We employ a multilingual text encoder for broader prompt understanding.<n>LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene.
arXiv Detail & Related papers (2026-01-06T18:24:41Z)
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model [34.874153953305346]
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content.<n>Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization.<n>We propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model.
arXiv Detail & Related papers (2025-11-23T16:51:05Z)
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation [5.304004483404346]
Ovi is a unified paradigm for audio-video generation that models the two modalities as a single generative process.<n>Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects.<n>Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips.
arXiv Detail & Related papers (2025-09-30T21:03:50Z)
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z)
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.<n>This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
VideoPoet: A Large Language Model for Zero-Shot Video Generation [78.57171527944774]
VideoPoet is a language model capable of synthesizing high-quality video with matching audio. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs.
arXiv Detail & Related papers (2023-12-21T18:46:41Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.