ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement
- URL: http://arxiv.org/abs/2212.11377v1
- Date: Wed, 21 Dec 2022 21:36:52 GMT
- Title: ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement
- Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
- Abstract summary: ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
- Score: 40.29155338515071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior works on improving speech quality with visual input typically study
each type of auditory distortion separately (e.g., separation, inpainting,
video-to-speech) and present tailored algorithms. This paper proposes to unify
these subjects and study Generalized Speech Enhancement, where the goal is not
to reconstruct the exact reference clean signal, but to focus on improving
certain aspects of speech. In particular, this paper concerns intelligibility,
quality, and video synchronization. We cast the problem as audio-visual speech
resynthesis, which is composed of two steps: pseudo audio-visual speech
recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and
P-TTS are connected by discrete units derived from a self-supervised speech
model. Moreover, we utilize self-supervised audio-visual speech model to
initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first
high-quality model for in-the-wild video-to-speech synthesis and achieves
superior performance on all LRS3 audio-visual enhancement tasks with a single
model. To demonstrates its applicability in the real world, ReVISE is also
evaluated on EasyCom, an audio-visual benchmark collected under challenging
acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE
greatly suppresses noise and improves quality. Project page:
https://wnhsu.github.io/ReVISE.
Related papers
- AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement [18.193191170754744]
We introduce AV2Wav, a re-synthesis-based audio-visual speech enhancement approach.
We use continuous rather than discrete representations to retain prosody and speaker information.
Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test.
arXiv Detail & Related papers (2023-09-14T21:07:53Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.