Localizing Visual Sounds the Easy Way
- URL: http://arxiv.org/abs/2203.09324v1
- Date: Thu, 17 Mar 2022 13:52:58 GMT
- Title: Localizing Visual Sounds the Easy Way
- Authors: Shentong Mo, Pedro Morgado
- Abstract summary: Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training.
We propose EZ-VSL, without relying on the construction of positive and/or negative regions during training.
Our framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source.
- Score: 26.828874753756523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised audio-visual source localization aims at localizing visible
sound sources in a video without relying on ground-truth localization for
training. Previous works often seek high audio-visual similarities for likely
positive (sounding) regions and low similarities for likely negative regions.
However, accurately distinguishing between sounding and non-sounding regions is
challenging without manual annotations. In this work, we propose a simple yet
effective approach for Easy Visual Sound Localization, namely EZ-VSL, without
relying on the construction of positive and/or negative regions during
training. Instead, we align audio and visual spaces by seeking audio-visual
representations that are aligned in, at least, one location of the associated
image, while not matching other images, at any location. We also introduce a
novel object guided localization scheme at inference time for improved
precision. Our simple and effective framework achieves state-of-the-art
performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source. In
particular, we improve the CIoU of the Flickr SoundNet test set from 76.80% to
83.94%, and on the VGG-Sound Source dataset from 34.60% to 38.85%. The code is
available at https://github.com/stoneMo/EZ-VSL.
Related papers
- Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks.
The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias.
Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z) - LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator.
The proposed LAVSS is inspired by the correlation between spatial audio and visual location.
In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources.
We propose an audio-visual spatial integration network that integrates spatial cues from both modalities.
Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z) - Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment [22.912401512161132]
We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities.
We translate the input audio to visual features, then use a pre-trained generator to produce an image.
We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches.
arXiv Detail & Related papers (2023-03-30T16:01:50Z) - Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Localizing Visual Sounds the Hard Way [149.84890978170174]
We train the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound.
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
We introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset.
arXiv Detail & Related papers (2021-04-06T17:38:18Z) - Multiple Sound Sources Localization from Coarse to Fine [41.56420350529494]
How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
arXiv Detail & Related papers (2020-07-13T12:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.