On Negative Sampling for Audio-Visual Contrastive Learning from Movies
- URL: http://arxiv.org/abs/2205.00073v1
- Date: Fri, 29 Apr 2022 20:36:13 GMT
- Title: On Negative Sampling for Audio-Visual Contrastive Learning from Movies
- Authors: Mahdi M. Kalayeh, Shervin Ardeshir, Lingyi Liu, Nagendra Kamath, Ashok
Chandrashekar
- Abstract summary: We study the efficacy of audio-visual self-supervised learning from uncurated long-form content i.e. movies.
Our empirical findings suggest that, with certain modifications, training on uncurated long-form videos yields representations which transfer competitively with the state-of-the-art.
- Score: 12.967364755951722
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The abundance and ease of utilizing sound, along with the fact that auditory
clues reveal a plethora of information about what happens in a scene, make the
audio-visual space an intuitive choice for representation learning. In this
paper, we explore the efficacy of audio-visual self-supervised learning from
uncurated long-form content i.e movies. Studying its differences with
conventional short-form content, we identify a non-i.i.d distribution of data,
driven by the nature of movies. Specifically, we find long-form content to
naturally contain a diverse set of semantic concepts (semantic diversity),
where a large portion of them, such as main characters and environments often
reappear frequently throughout the movie (reoccurring semantic concepts). In
addition, movies often contain content-exclusive artistic artifacts, such as
color palettes or thematic music, which are strong signals for uniquely
distinguishing a movie (non-semantic consistency). Capitalizing on these
observations, we comprehensively study the effect of emphasizing within-movie
negative sampling in a contrastive learning setup. Our view is different from
those of prior works who consider within-video positive sampling, inspired by
the notion of semantic persistency over time, and operate in a short-video
regime. Our empirical findings suggest that, with certain modifications,
training on uncurated long-form videos yields representations which transfer
competitively with the state-of-the-art to a variety of action recognition and
audio classification tasks.
Related papers
- Towards Contrastive Learning in Music Video Domain [46.29203572184694]
We create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss.
For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song dataset.
Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks.
arXiv Detail & Related papers (2023-09-01T09:08:21Z) - Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning.
Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - Watching Too Much Television is Good: Self-Supervised Audio-Visual
Representation Learning from Movies and TV Shows [6.247268652296234]
We study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning.
We demonstrate that a simple model based on contrastive learning, trained on a collection of movies and TV shows, dramatically outperforms more complex methods.
arXiv Detail & Related papers (2021-06-16T02:00:11Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.