Robust Audio-Visual Instance Discrimination
- URL: http://arxiv.org/abs/2103.15916v1
- Date: Mon, 29 Mar 2021 19:52:29 GMT
- Title: Robust Audio-Visual Instance Discrimination
- Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos
- Abstract summary: We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
- Score: 79.74625434659443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a self-supervised learning method to learn audio and video
representations. Prior work uses the natural correspondence between audio and
video to define a standard cross-modal instance discrimination task, where a
model is trained to match representations from the two modalities. However, the
standard approach introduces two sources of training noise. First, audio-visual
correspondences often produce faulty positives since the audio and video
signals can be uninformative of each other. To limit the detrimental impact of
faulty positives, we optimize a weighted contrastive learning loss, which
down-weighs their contribution to the overall loss. Second, since
self-supervised contrastive learning relies on random sampling of negative
instances, instances that are semantically similar to the base instance can be
used as faulty negatives. To alleviate the impact of faulty negatives, we
propose to optimize an instance discrimination loss with a soft target
distribution that estimates relationships between instances. We validate our
contributions through extensive experiments on action recognition tasks and
show that they address the problems of audio-visual instance discrimination and
improve transfer learning performance.
Related papers
- Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z) - Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z) - Similarity Contrastive Estimation for Image and Video Soft Contrastive
Self-Supervised Learning [0.22940141855172028]
We propose a novel formulation of contrastive learning using semantic similarity between instances.
Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances.
We show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
arXiv Detail & Related papers (2022-12-21T16:56:55Z) - MarginNCE: Robust Sound Localization with a Negative Margin [23.908770938403503]
The goal of this work is to localize sound sources in visual scenes with a self-supervised approach.
We show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization.
arXiv Detail & Related papers (2022-11-03T16:44:14Z) - Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype
Contrast [34.58856143210749]
We present an approach to learn voice-face representations from the talking face videos, without any identity labels.
Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face.
We propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives.
arXiv Detail & Related papers (2022-04-28T07:28:56Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views.
We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z) - Incremental False Negative Detection for Contrastive Learning [95.68120675114878]
We introduce a novel incremental false negative detection for self-supervised contrastive learning.
During contrastive learning, we discuss two strategies to explicitly remove the detected false negatives.
Our proposed method outperforms other self-supervised contrastive learning frameworks on multiple benchmarks within a limited compute.
arXiv Detail & Related papers (2021-06-07T15:29:14Z) - Audio-Visual Instance Discrimination with Cross-Modal Agreement [90.95132499006498]
We present a self-supervised learning approach to learn audio-visual representations from video and audio.
We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
arXiv Detail & Related papers (2020-04-27T16:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.