SVTS: Scalable Video-to-Speech Synthesis
- URL: http://arxiv.org/abs/2205.02058v1
- Date: Wed, 4 May 2022 13:34:07 GMT
- Title: SVTS: Scalable Video-to-Speech Synthesis
- Authors: Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bj\"orn W.
Schuller and Maja Pantic
- Abstract summary: We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
- Score: 105.29009019733803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-to-speech synthesis (also known as lip-to-speech) refers to the
translation of silent lip movements into the corresponding audio. This task has
received an increasing amount of attention due to its self-supervised nature
(i.e., can be trained without manual labelling) combined with the ever-growing
collection of audio-visual data available online. Despite these strong
motivations, contemporary video-to-speech works focus mainly on small- to
medium-sized corpora with substantial constraints in both vocabulary and
setting. In this work, we introduce a scalable video-to-speech framework
consisting of two components: a video-to-spectrogram predictor and a
pre-trained neural vocoder, which converts the mel-frequency spectrograms into
waveform audio. We achieve state-of-the art results for GRID and considerably
outperform previous approaches on LRW. More importantly, by focusing on
spectrogram prediction using a simple feedforward model, we can efficiently and
effectively scale our method to very large and unconstrained datasets: To the
best of our knowledge, we are the first to show intelligible results on the
challenging LRS3 dataset.
Related papers
- video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well.
Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.