Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing
- URL: http://arxiv.org/abs/2202.06406v1
- Date: Sun, 13 Feb 2022 21:06:19 GMT
- Title: Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing
- Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei
Zhou, Xiaowei Zhou
- Abstract summary: In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
- Score: 90.21476231683008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of audio-visual sound source localization has been well studied
under constrained scenes, where the audio recordings are clean. However, in
real-world scenarios, audios are usually contaminated by off-screen sound and
background noise. They will interfere with the procedure of identifying desired
sources and building visual-sound connections, making previous studies
non-applicable. In this work, we propose the Interference Eraser (IEr)
framework, which tackles the problem of audio-visual sound source localization
in the wild. The key idea is to eliminate the interference by redefining and
carving discriminative audio representations. Specifically, we observe that the
previous practice of learning only a single audio representation is
insufficient due to the additive nature of audio signals. We thus extend the
audio representation with our Audio-Instance-Identifier module, which clearly
distinguishes sounding instances when audio signals of different volumes are
unevenly mixed. Then we erase the influence of the audible but off-screen
sounds and the silent but visible objects by a Cross-modal Referrer module with
cross-modality distillation. Quantitative and qualitative evaluations
demonstrate that our proposed framework achieves superior results on sound
localization tasks, especially under real-world scenarios. Code is available at
https://github.com/alvinliu0/Visual-Sound-Localization-in-the-Wild.
Related papers
- LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator.
The proposed LAVSS is inspired by the correlation between spatial audio and visual location.
In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework.
In the first stage, we employ a segmentation model to localize potential sounding objects from visual data.
In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds [33.4237979175049]
We present AudioScope, a novel audio-visual sound separation framework.
It can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos.
We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data.
arXiv Detail & Related papers (2020-11-02T17:36:13Z) - Multiple Sound Sources Localization from Coarse to Fine [41.56420350529494]
How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
arXiv Detail & Related papers (2020-07-13T12:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.