Learning Audio-Visual Dereverberation
- URL: http://arxiv.org/abs/2106.07732v1
- Date: Mon, 14 Jun 2021 20:01:24 GMT
- Title: Learning Audio-Visual Dereverberation
- Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman
- Abstract summary: Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
- Score: 87.52880019747435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reverberation from audio reflecting off surfaces and objects in the
environment not only degrades the quality of speech for human perception, but
also severely impacts the accuracy of automatic speech recognition. Prior work
attempts to remove reverberation based on the audio modality only. Our idea is
to learn to dereverberate speech from audio-visual observations. The visual
environment surrounding a human speaker reveals important cues about the room
geometry, materials, and speaker location, all of which influence the precise
reverberation effects in the audio stream. We introduce Visually-Informed
Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove
reverberation based on both the observed sounds and visual scene. In support of
this new task, we develop a large-scale dataset that uses realistic acoustic
renderings of speech in real-world 3D scans of homes offering a variety of room
acoustics. Demonstrating our approach on both simulated and real imagery for
speech enhancement, speech recognition, and speaker identification, we show it
achieves state-of-the-art performance and substantially improves over
traditional audio-only methods. Project page:
http://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation.
Related papers
- LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.