Egocentric Audio-Visual Noise Suppression
- URL: http://arxiv.org/abs/2211.03643v2
- Date: Wed, 3 May 2023 02:34:08 GMT
- Title: Egocentric Audio-Visual Noise Suppression
- Authors: Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu and
Kaustubh Kalgaonkar
- Abstract summary: This paper studies audio-visual noise suppression for egocentric videos.
Video camera emulates off-screen speaker's view of the outside world.
We first demonstrate that egocentric visual information is helpful for noise suppression.
- Score: 11.113020254726292
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper studies audio-visual noise suppression for egocentric videos --
where the speaker is not captured in the video. Instead, potential noise
sources are visible on screen with the camera emulating the off-screen
speaker's view of the outside world. This setting is different from prior work
in audio-visual speech enhancement that relies on lip and facial visuals. In
this paper, we first demonstrate that egocentric visual information is helpful
for noise suppression. We compare object recognition and action
classification-based visual feature extractors and investigate methods to align
audio and visual representations. Then, we examine different fusion strategies
for the aligned features, and locations within the noise suppression model to
incorporate visual information. Experiments demonstrate that visual features
are most helpful when used to generate additive correction masks. Finally, in
order to ensure that the visual features are discriminative with respect to
different noise types, we introduce a multi-task learning framework that
jointly optimizes audio-visual noise suppression and video-based acoustic event
detection. This proposed multi-task framework outperforms the audio-only
baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations
reveal the improved performance of the proposed model with multiple active
distractors, overall noise types, and across different SNRs.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Visual Context-driven Audio Feature Enhancement for Robust End-to-End
Audio-Visual Speech Recognition [29.05833230733178]
We propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence.
The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context.
The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.
arXiv Detail & Related papers (2022-07-13T08:07:19Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.