Visual Speech Enhancement Without A Real Visual Stream
- URL: http://arxiv.org/abs/2012.10852v1
- Date: Sun, 20 Dec 2020 06:02:12 GMT
- Title: Visual Speech Enhancement Without A Real Visual Stream
- Authors: Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri,
C.V. Jawahar
- Abstract summary: Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises.
Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods.
We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis.
- Score: 37.88869937166955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we re-think the task of speech enhancement in unconstrained
real-world environments. Current state-of-the-art methods use only the audio
stream and are limited in their performance in a wide range of real-world
noises. Recent works using lip movements as additional cues improve the quality
of generated speech over "audio-only" methods. But, these methods cannot be
used for several applications where the visual stream is unreliable or
completely absent. We propose a new paradigm for speech enhancement by
exploiting recent breakthroughs in speech-driven lip synthesis. Using one such
model as a teacher network, we train a robust student network to produce
accurate lip movements that mask away the noise, thus acting as a "visual noise
filter". The intelligibility of the speech enhanced by our pseudo-lip approach
is comparable (< 3% difference) to the case of using real lips. This implies
that we can exploit the advantages of using lip movements even in the absence
of a real video stream. We rigorously evaluate our model using quantitative
metrics as well as human evaluations. Additional ablation studies and a demo
video on our website containing qualitative comparisons and results clearly
illustrate the effectiveness of our approach. We provide a demo video which
clearly illustrates the effectiveness of our proposed approach on our website:
\url{http://cvit.iiit.ac.in/research/projects/cvit-projects/visual-speech-enhancement-without-a-real-visu al-stream}.
The code and models are also released for future research:
\url{https://github.com/Sindhu-Hegde/pseudo-visual-speech-denoising}.
Related papers
- SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model [35.60147467774199]
SAV-SE is first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise.
To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance.
arXiv Detail & Related papers (2024-11-12T12:23:41Z) - Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert [13.60808166889775]
We introduce a method for speech-driven 3D facial animation to generate accurate lip movements.
This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts.
We validate the effectiveness of our approach through broad experiments, showing noticeable improvements in lip synchronization and lip readability performance.
arXiv Detail & Related papers (2024-07-01T07:39:28Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild [44.92322575562816]
We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
arXiv Detail & Related papers (2022-09-01T17:50:29Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.