End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks
- URL: http://arxiv.org/abs/2104.13332v1
- Date: Tue, 27 Apr 2021 17:12:30 GMT
- Title: End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks
- Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis,
Bj\"orn W. Schuller, Maja Pantic
- Abstract summary: We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
- Score: 54.43697805589634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-to-speech is the process of reconstructing the audio speech from a
video of a spoken utterance. Previous approaches to this task have relied on a
two-step process where an intermediate representation is inferred from the
video, and is then decoded into waveform audio using a vocoder or a waveform
reconstruction algorithm. In this work, we propose a new end-to-end
video-to-speech model based on Generative Adversarial Networks (GANs) which
translates spoken video to waveform end-to-end without using any intermediate
representation or separate waveform synthesis algorithm. Our model consists of
an encoder-decoder architecture that receives raw video as input and generates
speech, which is then fed to a waveform critic and a power critic. The use of
an adversarial loss based on these two critics enables the direct synthesis of
raw audio waveform and ensures its realism. In addition, the use of our three
comparative losses helps establish direct correspondence between the generated
audio and the input video. We show that this model is able to reconstruct
speech with remarkable realism for constrained datasets such as GRID, and that
it is the first end-to-end model to produce intelligible speech for LRW (Lip
Reading in the Wild), featuring hundreds of speakers recorded entirely `in the
wild'. We evaluate the generated samples in two different scenarios -- seen and
unseen speakers -- using four objective metrics which measure the quality and
intelligibility of artificial speech. We demonstrate that the proposed approach
outperforms all previous works in most metrics on GRID and LRW.
Related papers
- Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement [40.29155338515071]
ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
arXiv Detail & Related papers (2022-12-21T21:36:52Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.