VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
- URL: http://arxiv.org/abs/2504.02386v1
- Date: Thu, 03 Apr 2025 08:24:47 GMT
- Title: VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
- Authors: Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath,
- Abstract summary: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.<n>This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
- Score: 43.1613638989795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.
Related papers
- OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication [19.688375369516923]
We introduce an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios.<n>Our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
arXiv Detail & Related papers (2025-04-03T09:48:13Z) - Long-Video Audio Synthesis with Multi-Agent Collaboration [20.332328741375363]
LVAS-Agent is a novel framework that emulates professional dubbing through collaborative role.<n>Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis.<n>Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment.
arXiv Detail & Related papers (2025-03-13T07:58:23Z) - Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers [58.86974149731874]
Cosh-DiT is a Co-speech gesture video system with hybrid Diffusion Transformers.<n>We introduce an audio Diffusion Transformer to synthesize expressive gesture dynamics synchronized with speech rhythms.<n>For realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer.
arXiv Detail & Related papers (2025-03-13T01:36:05Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - MovieFactory: Automatic Movie Creation from Text using Large Generative
Models for Language and Images [92.13079696503803]
We present MovieFactory, a framework to generate cinematic-picture (3072$times$1280), film-style (multi-scene), and multi-modality (sounding) movies.
Our approach empowers users to create captivating movies with smooth transitions using simple text inputs.
arXiv Detail & Related papers (2023-06-12T17:31:23Z) - Fine-grained Audible Video Description [61.81122862375985]
We construct the first fine-grained audible video description benchmark (FAVDBench)
For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end.
We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
arXiv Detail & Related papers (2023-03-27T22:03:48Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Neural Dubber: Dubbing for Silent Videos According to Scripts [22.814626504851752]
We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task.
Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech.
Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
arXiv Detail & Related papers (2021-10-15T17:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.