Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos
- URL: http://arxiv.org/abs/2206.04523v1
- Date: Thu, 9 Jun 2022 14:15:37 GMT
- Title: Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos
- Authors: Alexander Waibel and Moritz Behr and Fevziye Irem Eyiokur and Dogucan
Yaman and Tuan-Nam Nguyen and Carlos Mullov and Mehmet Arif Demirtas and
Alperen Kantarc{\i} and Stefan Constantin and Haz{\i}m Kemal Ekenel
- Abstract summary: The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
- Score: 54.08224321456871
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we propose a neural end-to-end system for voice preserving,
lip-synchronous translation of videos. The system is designed to combine
multiple component models and produces a video of the original speaker speaking
in the target language that is lip-synchronous with the target speech, yet
maintains emphases in speech, voice characteristics, face video of the original
speaker. The pipeline starts with automatic speech recognition including
emphasis detection, followed by a translation model. The translated text is
then synthesized by a Text-to-Speech model that recreates the original emphases
mapped from the original sentence. The resulting synthetic voice is then mapped
back to the original speakers' voice using a voice conversion model. Finally,
to synchronize the lips of the speaker with the translated audio, a conditional
generative adversarial network-based model generates frames of adapted lip
movements with respect to the input face image as well as the output of the
voice conversion model. In the end, the system combines the generated video
with the converted audio to produce the final output. The result is a video of
a speaker speaking in another language without actually knowing it. To evaluate
our design, we present a user study of the complete system as well as separate
evaluations of the single components. Since there is no available dataset to
evaluate our whole system, we collect a test set and evaluate our system on
this test set. The results indicate that our system is able to generate
convincing videos of the original speaker speaking the target language while
preserving the original speaker's characteristics. The collected dataset will
be shared.
Related papers
- Automatic Voice Identification after Speech Resynthesis using PPG [13.041006302302808]
Speech resynthesis is a generic task for which we want to synthesize audio with another audio as input.
This paper presents a PPG-based speech resynthesis system.
A perceptive evaluation assesses that it produces correct audio quality.
arXiv Detail & Related papers (2024-08-05T13:59:40Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Large-scale multilingual audio visual dubbing [31.43873011591989]
We describe a system for large-scale audiovisual translation and dubbing.
The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech.
The visual content is translated by synthesizing lip movements for the speaker to match the translated audio.
arXiv Detail & Related papers (2020-11-06T18:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.