Enhancing Sound Source Localization via False Negative Elimination
- URL: http://arxiv.org/abs/2408.16448v1
- Date: Thu, 29 Aug 2024 11:24:51 GMT
- Title: Enhancing Sound Source Localization via False Negative Elimination
- Authors: Zengjie Song, Jiangshe Zhang, Yuxi Wang, Junsong Fan, Zhaoxiang Zhang,
- Abstract summary: Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
- Score: 58.87973081084927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.
Related papers
- Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z) - Contrastive Positive Sample Propagation along the Audio-Visual Event
Line [24.007548531642716]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
It is pivotal to learn the discriminative features for each video segment.
We propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning.
arXiv Detail & Related papers (2022-11-18T01:55:45Z) - MarginNCE: Robust Sound Localization with a Negative Margin [23.908770938403503]
The goal of this work is to localize sound sources in visual scenes with a self-supervised approach.
We show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization.
arXiv Detail & Related papers (2022-11-03T16:44:14Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - FREE: Feature Refinement for Generalized Zero-Shot Learning [86.41074134041394]
Generalized zero-shot learning (GZSL) has achieved significant progress, with many efforts dedicated to overcoming the problems of visual-semantic domain gap and seen-unseen bias.
Most existing methods directly use feature extraction models trained on ImageNet alone, ignoring the cross-dataset bias between ImageNet and GZSL benchmarks.
We propose a simple yet effective GZSL method, termed feature refinement for generalized zero-shot learning (FREE) to tackle the above problem.
arXiv Detail & Related papers (2021-07-29T08:11:01Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z) - Whitening for Self-Supervised Representation Learning [129.57407186848917]
We propose a new loss function for self-supervised representation learning (SSL) based on the whitening of latent-space features.
Our solution does not require asymmetric networks and it is conceptually simple.
arXiv Detail & Related papers (2020-07-13T12:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.