Watching Too Much Television is Good: Self-Supervised Audio-Visual
Representation Learning from Movies and TV Shows
- URL: http://arxiv.org/abs/2106.08513v1
- Date: Wed, 16 Jun 2021 02:00:11 GMT
- Title: Watching Too Much Television is Good: Self-Supervised Audio-Visual
Representation Learning from Movies and TV Shows
- Authors: Mahdi M. Kalayeh, Nagendra Kamath, Lingyi Liu and Ashok Chandrashekar
- Abstract summary: We study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning.
We demonstrate that a simple model based on contrastive learning, trained on a collection of movies and TV shows, dramatically outperforms more complex methods.
- Score: 6.247268652296234
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The abundance and ease of utilizing sound, along with the fact that auditory
clues reveal so much about what happens in the scene, make the audio-visual
space a perfectly intuitive choice for self-supervised representation learning.
However, the current literature suggests that training on \textit{uncurated}
data yields considerably poorer representations compared to the
\textit{curated} alternatives collected in supervised manner, and the gap only
narrows when the volume of data significantly increases. Furthermore, the
quality of learned representations is known to be heavily influenced by the
size and taxonomy of the curated datasets used for self-supervised training.
This begs the question of whether we are celebrating too early on catching up
with supervised learning when our self-supervised efforts still rely almost
exclusively on curated data. In this paper, we study the efficacy of learning
from Movies and TV Shows as forms of uncurated data for audio-visual
self-supervised learning. We demonstrate that a simple model based on
contrastive learning, trained on a collection of movies and TV shows, not only
dramatically outperforms more complex methods which are trained on orders of
magnitude larger uncurated datasets, but also performs very competitively with
the state-of-the-art that learns from large-scale curated data. We identify
that audiovisual patterns like the appearance of the main character or
prominent scenes and mise-en-sc\`ene which frequently occur through the whole
duration of a movie, lead to an overabundance of easy negative instances in the
contrastive learning formulation. Capitalizing on such observation, we propose
a hierarchical sampling policy, which despite its simplicity, effectively
improves the performance, particularly when learning from TV shows which
naturally face less semantic diversity.
Related papers
- Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances.
Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.
We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - Speech representation learning: Learning bidirectional encoders with
single-view, multi-view, and multi-task methods [7.1345443932276424]
This thesis focuses on representation learning for sequence data over time or space.
It aims to improve downstream sequence prediction tasks by using the learned representations.
arXiv Detail & Related papers (2023-07-25T20:38:55Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - On Negative Sampling for Audio-Visual Contrastive Learning from Movies [12.967364755951722]
We study the efficacy of audio-visual self-supervised learning from uncurated long-form content i.e. movies.
Our empirical findings suggest that, with certain modifications, training on uncurated long-form videos yields representations which transfer competitively with the state-of-the-art.
arXiv Detail & Related papers (2022-04-29T20:36:13Z) - The Impact of Spatiotemporal Augmentations on Self-Supervised
Audiovisual Representation Learning [2.28438857884398]
We present a contrastive framework to learn audiovisual representations from unlabeled videos.
We find lossy-temporal transformations that do not corrupt the temporal coherency of videos are the most effective.
Compared to self-supervised models pre-trained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear performance on dataset AVE.
arXiv Detail & Related papers (2021-10-13T23:48:58Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.