Related papers: Robust Audio-Visual Instance Discrimination

Robust Audio-Visual Instance Discrimination

URL: http://arxiv.org/abs/2103.15916v1
Date: Mon, 29 Mar 2021 19:52:29 GMT
Title: Robust Audio-Visual Instance Discrimination
Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos
Abstract summary: We present a self-supervised learning method to learn audio and video representations. We address the problems of audio-visual instance discrimination and improve transfer learning performance.
Score: 79.74625434659443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty negatives, we propose to optimize an instance discrimination loss with a soft target distribution that estimates relationships between instances. We validate our contributions through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance.

Related papers

Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z)
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z)
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning [0.22940141855172028]
We propose a novel formulation of contrastive learning using semantic similarity between instances. Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances. We show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
arXiv Detail & Related papers (2022-12-21T16:56:55Z)
MarginNCE: Robust Sound Localization with a Negative Margin [23.908770938403503]
The goal of this work is to localize sound sources in visual scenes with a self-supervised approach. We show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization.
arXiv Detail & Related papers (2022-11-03T16:44:14Z)
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast [34.58856143210749]
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. We propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives.
arXiv Detail & Related papers (2022-04-28T07:28:56Z)
Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs. We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z)
Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views. We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z)
Incremental False Negative Detection for Contrastive Learning [95.68120675114878]
We introduce a novel incremental false negative detection for self-supervised contrastive learning. During contrastive learning, we discuss two strategies to explicitly remove the detected false negatives. Our proposed method outperforms other self-supervised contrastive learning frameworks on multiple benchmarks within a limited compute.
arXiv Detail & Related papers (2021-06-07T15:29:14Z)
Audio-Visual Instance Discrimination with Cross-Modal Agreement [90.95132499006498]
We present a self-supervised learning approach to learn audio-visual representations from video and audio. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
arXiv Detail & Related papers (2020-04-27T16:59:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.