Related papers: Neural Dubber: Dubbing for Silent Videos According to Scripts

Neural Dubber: Dubbing for Silent Videos According to Scripts

URL: http://arxiv.org/abs/2110.08243v1
Date: Fri, 15 Oct 2021 17:56:07 GMT
Title: Neural Dubber: Dubbing for Silent Videos According to Scripts
Authors: Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao
Abstract summary: We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task. Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech. Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
Score: 22.814626504851752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

Related papers

MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing [12.954750400557344]
We introduce a multi-modal generative framework for movie dubbing.<n>It produces high-quality dubbing using large speech generation models, guided by multi-modal inputs.<n>Results show superior performance compared to state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-05-22T06:23:05Z)
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z)
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet. We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z)
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech [9.035846000646481]
Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text. We demonstrate how this allows VDTTS to generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.
arXiv Detail & Related papers (2021-11-19T10:23:38Z)
AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue. We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task. Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.