Audio-Visual Instance Discrimination with Cross-Modal Agreement
- URL: http://arxiv.org/abs/2004.12943v3
- Date: Mon, 29 Mar 2021 20:14:23 GMT
- Title: Audio-Visual Instance Discrimination with Cross-Modal Agreement
- Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra
- Abstract summary: We present a self-supervised learning approach to learn audio-visual representations from video and audio.
We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
- Score: 90.95132499006498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a self-supervised learning approach to learn audio-visual
representations from video and audio. Our method uses contrastive learning for
cross-modal discrimination of video from audio and vice-versa. We show that
optimizing for cross-modal discrimination, rather than within-modal
discrimination, is important to learn good representations from video and
audio. With this simple but powerful insight, our method achieves highly
competitive performance when finetuned on action recognition tasks.
Furthermore, while recent work in contrastive learning defines positive and
negative samples as individual instances, we generalize this definition by
exploring cross-modal agreement. We group together multiple instances as
positives by measuring their similarity in both the video and audio feature
spaces. Cross-modal agreement creates better positive and negative sets, which
allows us to calibrate visual similarities by seeking within-modal
discrimination of positive instances, and achieve significant gains on
downstream tasks.
Related papers
- Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning.
Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype
Contrast [34.58856143210749]
We present an approach to learn voice-face representations from the talking face videos, without any identity labels.
Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face.
We propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives.
arXiv Detail & Related papers (2022-04-28T07:28:56Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.