MarginNCE: Robust Sound Localization with a Negative Margin
- URL: http://arxiv.org/abs/2211.01966v1
- Date: Thu, 3 Nov 2022 16:44:14 GMT
- Title: MarginNCE: Robust Sound Localization with a Negative Margin
- Authors: Sooyoung Park, Arda Senocak, Joon Son Chung
- Abstract summary: The goal of this work is to localize sound sources in visual scenes with a self-supervised approach.
We show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization.
- Score: 23.908770938403503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this work is to localize sound sources in visual scenes with a
self-supervised approach. Contrastive learning in the context of sound source
localization leverages the natural correspondence between audio and visual
signals where the audio-visual pairs from the same source are assumed as
positive, while randomly selected pairs are negatives. However, this approach
brings in noisy correspondences; for example, positive audio and visual pair
signals that may be unrelated to each other, or negative pairs that may contain
semantically similar samples to the positive one. Our key contribution in this
work is to show that using a less strict decision boundary in contrastive
learning can alleviate the effect of noisy correspondences in sound source
localization. We propose a simple yet effective approach by slightly modifying
the contrastive loss with a negative margin. Extensive experimental results
show that our approach gives on-par or better performance than the
state-of-the-art methods. Furthermore, we demonstrate that the introduction of
a negative margin to existing methods results in a consistent improvement in
performance.
Related papers
- Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z) - Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype
Contrast [34.58856143210749]
We present an approach to learn voice-face representations from the talking face videos, without any identity labels.
Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face.
We propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives.
arXiv Detail & Related papers (2022-04-28T07:28:56Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views.
We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.