Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning
- URL: http://arxiv.org/abs/2303.11302v2
- Date: Sat, 25 Mar 2023 13:44:25 GMT
- Title: Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning
- Authors: Weixuan Sun and Jiayi Zhang and Jianyuan Wang and Zheyuan Liu and
Yiran Zhong and Tianpeng Feng and Yandong Guo and Yanhao Zhang and Nick
Barnes
- Abstract summary: We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
- Score: 39.890616126301204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised audio-visual source localization aims to locate sound-source
objects in video frames without extra annotations. Recent methods often
approach this goal with the help of contrastive learning, which assumes only
the audio and visual contents from the same video are positive samples for each
other. However, this assumption would suffer from false negative samples in
real-world training. For example, for an audio sample, treating the frames from
the same audio class as negative samples may mislead the model and therefore
harm the learned representations e.g., the audio of a siren wailing may
reasonably correspond to the ambulances in multiple images). Based on this
observation, we propose a new learning strategy named False Negative Aware
Contrastive (FNAC) to mitigate the problem of misleading the training with such
false negative samples. Specifically, we utilize the intra-modal similarities
to identify potentially similar samples and construct corresponding adjacency
matrices to guide contrastive learning. Further, we propose to strengthen the
role of true negative samples by explicitly leveraging the visual features of
sound sources to facilitate the differentiation of authentic sounding source
regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet,
VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in
mitigating the false negative issue. The code is available at
\url{https://github.com/OpenNLPLab/FNAC_AVL}.
Related papers
- Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z) - MarginNCE: Robust Sound Localization with a Negative Margin [23.908770938403503]
The goal of this work is to localize sound sources in visual scenes with a self-supervised approach.
We show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization.
arXiv Detail & Related papers (2022-11-03T16:44:14Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.