The Impact of Spatiotemporal Augmentations on Self-Supervised
Audiovisual Representation Learning
- URL: http://arxiv.org/abs/2110.07082v1
- Date: Wed, 13 Oct 2021 23:48:58 GMT
- Title: The Impact of Spatiotemporal Augmentations on Self-Supervised
Audiovisual Representation Learning
- Authors: Haider Al-Tahan and Yalda Mohsenzadeh
- Abstract summary: We present a contrastive framework to learn audiovisual representations from unlabeled videos.
We find lossy-temporal transformations that do not corrupt the temporal coherency of videos are the most effective.
Compared to self-supervised models pre-trained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear performance on dataset AVE.
- Score: 2.28438857884398
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive learning of auditory and visual perception has been extremely
successful when investigated individually. However, there are still major
questions on how we could integrate principles learned from both domains to
attain effective audiovisual representations. In this paper, we present a
contrastive framework to learn audiovisual representations from unlabeled
videos. The type and strength of augmentations utilized during self-supervised
pre-training play a crucial role for contrastive frameworks to work
sufficiently. Hence, we extensively investigate composition of temporal
augmentations suitable for learning audiovisual representations; we find lossy
spatio-temporal transformations that do not corrupt the temporal coherency of
videos are the most effective. Furthermore, we show that the effectiveness of
these transformations scales with higher temporal resolution and stronger
transformation intensity. Compared to self-supervised models pre-trained on
only sampling-based temporal augmentation, self-supervised models pre-trained
with our temporal augmentations lead to approximately 6.5% gain on linear
classifier performance on AVE dataset. Lastly, we show that despite their
simplicity, our proposed transformations work well across self-supervised
learning frameworks (SimSiam, MoCoV3, etc), and benchmark audiovisual dataset
(AVE).
Related papers
- Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances.
Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.
We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training [102.18680666349806]
We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data.
Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
arXiv Detail & Related papers (2023-09-25T08:22:30Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Exploiting Transformation Invariance and Equivariance for
Self-supervised Sound Localisation [32.68710772281511]
We present a self-supervised framework for audio-visual representation learning, to localize the sound source in videos.
Our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound.
This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications.
arXiv Detail & Related papers (2022-06-26T03:00:02Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Watching Too Much Television is Good: Self-Supervised Audio-Visual
Representation Learning from Movies and TV Shows [6.247268652296234]
We study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning.
We demonstrate that a simple model based on contrastive learning, trained on a collection of movies and TV shows, dramatically outperforms more complex methods.
arXiv Detail & Related papers (2021-06-16T02:00:11Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.