Neural Dubber: Dubbing for Silent Videos According to Scripts
- URL: http://arxiv.org/abs/2110.08243v1
- Date: Fri, 15 Oct 2021 17:56:07 GMT
- Title: Neural Dubber: Dubbing for Silent Videos According to Scripts
- Authors: Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao
- Abstract summary: We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task.
Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech.
Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
- Score: 22.814626504851752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dubbing is a post-production process of re-recording actors' dialogues, which
is extensively used in filmmaking and video production. It is usually performed
manually by professional voice actors who read lines with proper prosody, and
in synchronization with the pre-recorded videos. In this work, we propose
Neural Dubber, the first neural network model to solve a novel automatic video
dubbing (AVD) task: synthesizing human speech synchronized with the given
silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS)
model that utilizes the lip movement in the video to control the prosody of the
generated speech. Furthermore, an image-based speaker embedding (ISE) module is
developed for the multi-speaker setting, which enables Neural Dubber to
generate speech with a reasonable timbre according to the speaker's face.
Experiments on the chemistry lecture single-speaker dataset and LRS2
multi-speaker dataset show that Neural Dubber can generate speech audios on par
with state-of-the-art TTS models in terms of speech quality. Most importantly,
both qualitative and quantitative evaluations show that Neural Dubber can
control the prosody of synthesized speech by the video, and generate
high-fidelity speech temporally synchronized with the video.
Related papers
- SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech [9.035846000646481]
Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text.
We demonstrate how this allows VDTTS to generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.
arXiv Detail & Related papers (2021-11-19T10:23:38Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.