Audio-Visual Contrastive Learning with Temporal Self-Supervision
- URL: http://arxiv.org/abs/2302.07702v1
- Date: Wed, 15 Feb 2023 15:00:55 GMT
- Title: Audio-Visual Contrastive Learning with Temporal Self-Supervision
- Authors: Simon Jenni, Alexander Black, John Collomosse
- Abstract summary: We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
- Score: 84.11385346896412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a self-supervised learning approach for videos that learns
representations of both the RGB frames and the accompanying audio without human
supervision. In contrast to images that capture the static scene appearance,
videos also contain sound and temporal scene dynamics. To leverage the temporal
and aural dimension inherent to videos, our method extends temporal
self-supervision to the audio-visual setting and integrates it with multi-modal
contrastive objectives. As temporal self-supervision, we pose playback speed
and direction recognition in both modalities and propose intra- and inter-modal
temporal ordering tasks. Furthermore, we design a novel contrastive objective
in which the usual pairs are supplemented with additional sample-dependent
positives and negatives sampled from the evolving feature space. In our model,
we apply such losses among video clips and between videos and their temporally
corresponding audio clips. We verify our model design in extensive ablation
experiments and evaluate the video and audio representations in transfer
experiments to action recognition and retrieval on UCF101 and HMBD51, audio
classification on ESC50, and robust video fingerprinting on VGG-Sound, with
state-of-the-art results.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition [29.414663568089292]
Audio-visual speech recognition aims to transcribe human speech using both audio and video modalities.
In this study, we strengthen the video features by learning three temporal dynamics in video data.
We achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings.
arXiv Detail & Related papers (2024-07-04T01:25:20Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video [23.478555947694108]
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
We learn the meta-temporal visual variations in the video by leveraging the variations in the visual appearance according to playback speeds.
We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks.
arXiv Detail & Related papers (2020-03-05T15:01:08Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.