Learning from Silence and Noise for Visual Sound Source Localization
- URL: http://arxiv.org/abs/2508.21761v1
- Date: Fri, 29 Aug 2025 16:36:16 GMT
- Title: Learning from Silence and Noise for Visual Sound Source Localization
- Authors: Xavier Juanola, Giovana Morais, Magdalena Fuentes, Gloria Haro,
- Abstract summary: We propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds.<n>Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models.<n>We present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio.
- Score: 10.906490052260189
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.
Related papers
- Do Audio-Visual Segmentation Models Truly Segment Sounding Objects? [38.98706069359109]
We introduce AVSBench-Robust, a benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds.<n>Our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates.
arXiv Detail & Related papers (2025-02-01T07:40:29Z) - A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio [5.728456310555323]
This paper introduces a novel test set and metrics designed to complete the current standard evaluation of Visual Sound Source localization models.<n>We consider three types of negative audio: silence, noise and offscreen.<n>Our analysis reveals that numerous SOTA models fail to appropriately adjust their predictions based on audio input.
arXiv Detail & Related papers (2024-10-01T19:28:45Z) - Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - A Closer Look at Weakly-Supervised Audio-Visual Source Localization [26.828874753756523]
Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video.
We extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples.
We also propose a new approach for visual sound source localization that addresses both these problems.
arXiv Detail & Related papers (2022-08-30T14:17:46Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Dual Normalization Multitasking for Audio-Visual Sounding Object
Localization [0.0]
We propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound.
To tackle this new AVSOL problem, we propose a novel multitask training strategy and architecture called Dual Normalization Multitasking.
arXiv Detail & Related papers (2021-06-01T02:02:52Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.