ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement
- URL: http://arxiv.org/abs/2212.11377v1
- Date: Wed, 21 Dec 2022 21:36:52 GMT
- Title: ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement
- Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
- Abstract summary: ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
- Score: 40.29155338515071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior works on improving speech quality with visual input typically study
each type of auditory distortion separately (e.g., separation, inpainting,
video-to-speech) and present tailored algorithms. This paper proposes to unify
these subjects and study Generalized Speech Enhancement, where the goal is not
to reconstruct the exact reference clean signal, but to focus on improving
certain aspects of speech. In particular, this paper concerns intelligibility,
quality, and video synchronization. We cast the problem as audio-visual speech
resynthesis, which is composed of two steps: pseudo audio-visual speech
recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and
P-TTS are connected by discrete units derived from a self-supervised speech
model. Moreover, we utilize self-supervised audio-visual speech model to
initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first
high-quality model for in-the-wild video-to-speech synthesis and achieves
superior performance on all LRS3 audio-visual enhancement tasks with a single
model. To demonstrates its applicability in the real world, ReVISE is also
evaluated on EasyCom, an audio-visual benchmark collected under challenging
acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE
greatly suppresses noise and improves quality. Project page:
https://wnhsu.github.io/ReVISE.
Related papers
- NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing [16.47490478732181]
We propose an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components.
Specifically, we introduce a fundamental frequency (F0) predictor to capture prosodic variations in synthesized speech.
Our approach achieves satisfactory performance on speaker similarity without explicitly modelling speaker characteristics.
arXiv Detail & Related papers (2025-02-17T16:40:23Z) - Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization [59.1277150358203]
We propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos.
First, we create preference data via simulating common errors that occurred in AV-ASR from two focals.
Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference.
arXiv Detail & Related papers (2024-12-26T00:26:45Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.