AdVerb: Visually Guided Audio Dereverberation
- URL: http://arxiv.org/abs/2308.12370v1
- Date: Wed, 23 Aug 2023 18:20:59 GMT
- Title: AdVerb: Visually Guided Audio Dereverberation
- Authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton
Ratnarajah, Utkarsh Tyagi and Dinesh Manocha
- Abstract summary: We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
- Score: 49.958724234969445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present AdVerb, a novel audio-visual dereverberation framework that uses
visual cues in addition to the reverberant sound to estimate clean audio.
Although audio-only dereverberation is a well-studied problem, our approach
incorporates the complementary visual modality to perform audio
dereverberation. Given an image of the environment where the reverberated sound
signal has been recorded, AdVerb employs a novel geometry-aware cross-modal
transformer architecture that captures scene geometry and audio-visual
cross-modal relationship to generate a complex ideal ratio mask, which, when
applied to the reverberant audio predicts the clean sound. The effectiveness of
our method is demonstrated through extensive quantitative and qualitative
evaluations. Our approach significantly outperforms traditional audio-only and
audio-visual baselines on three downstream tasks: speech enhancement, speech
recognition, and speaker verification, with relative improvements in the range
of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly
satisfactory RT60 error scores on the AVSpeech dataset.
Related papers
- AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement [18.193191170754744]
We introduce AV2Wav, a re-synthesis-based audio-visual speech enhancement approach.
We use continuous rather than discrete representations to retain prosody and speaker information.
Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test.
arXiv Detail & Related papers (2023-09-14T21:07:53Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.