Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis
- URL: http://arxiv.org/abs/2203.17263v1
- Date: Thu, 31 Mar 2022 17:57:10 GMT
- Title: Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis
- Authors: Karren Yang, Dejan Markovic, Steven Krenn, Vasu Agrawal, Alexander
Richard
- Abstract summary: We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
- Score: 67.73554826428762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since facial actions such as lip movements contain significant information
about speech content, it is not surprising that audio-visual speech enhancement
methods are more accurate than their audio-only counterparts. Yet,
state-of-the-art approaches still struggle to generate clean, realistic speech
without noise artifacts and unnatural distortions in challenging acoustic
environments. In this paper, we propose a novel audio-visual speech enhancement
framework for high-fidelity telecommunications in AR/VR. Our approach leverages
audio-visual speech cues to generate the codes of a neural speech codec,
enabling efficient synthesis of clean, realistic speech from noisy signals.
Given the importance of speaker-specific cues in speech, we focus on developing
personalized models that work well for individual speakers. We demonstrate the
efficacy of our approach on a new audio-visual speech dataset collected in an
unconstrained, large vocabulary setting, as well as existing audio-visual
datasets, outperforming speech enhancement baselines on both quantitative
metrics and human evaluation studies. Please see the supplemental video for
qualitative results at
https://github.com/facebookresearch/facestar/releases/download/paper_materials/video.mp4.
Related papers
- AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Vocoder-Based Speech Synthesis from Silent Videos [28.94460283719776]
We present a way to synthesise speech from the silent video of a talker using deep learning.
The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.
arXiv Detail & Related papers (2020-04-06T10:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.