LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading
- URL: http://arxiv.org/abs/2112.04748v1
- Date: Thu, 9 Dec 2021 08:11:35 GMT
- Title: LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading
- Authors: Leyuan Qu, Cornelius Weber and Stefan Wermter
- Abstract summary: The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
- Score: 24.744371143092614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The aim of this work is to investigate the impact of crossmodal
self-supervised pre-training for speech reconstruction (video-to-audio) by
leveraging the natural co-occurrence of audio and visual streams in videos. We
propose LipSound2 which consists of an encoder-decoder architecture and
location-aware attention mechanism to map face image sequences to mel-scale
spectrograms directly without requiring any human annotations. The proposed
LipSound2 model is firstly pre-trained on $\sim$2400h multi-lingual (e.g.
English and German) audio-visual data (VoxCeleb2). To verify the
generalizability of the proposed method, we then fine-tune the pre-trained
model on domain-specific datasets (GRID, TCD-TIMIT) for English speech
reconstruction and achieve a significant improvement on speech quality and
intelligibility compared to previous approaches in speaker-dependent and
-independent settings. In addition to English, we conduct Chinese speech
reconstruction on the CMLR dataset to verify the impact on transferability.
Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning
the generated audios on a pre-trained speech recognition system and achieve
state-of-the-art performance on both English and Chinese benchmark datasets.
Related papers
- Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
Self-Supervised Representations [13.995231731152462]
We propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
A non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content.
A vocoder then converts the speech features into raw waveforms.
arXiv Detail & Related papers (2023-07-03T09:13:57Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement [40.29155338515071]
ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
arXiv Detail & Related papers (2022-12-21T21:36:52Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.